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Abstract 

Partially observable Markov decision processes (POMDPs) provide an elegant math- 
ematical framework for modeling complex decision and planning problems in stochastic 
domains in which states of the system are observable only indirectly, via a set of imperfect 
or noisy observations. The modeling advantage of POMDPs, however, comes at a price — 
exact methods for solving them are computationally very expensive and thus applicable 
in practice only to very simple problems. We focus on efficient approximation (heuristic) 
methods that attempt to alleviate the computational problem and trade off accuracy for 
speed. We have two objectives here. First, we survey various approximation methods, 
analyze their properties and relations and provide some new insights into their differences. 
Second, we present a number of new approximation methods and novel refinements of ex- 
isting techniques. The theoretical results are supported by experiments on a problem from 
the agent navigation domain. 

1. Introduction 

Making decisions in dynamic environments requires careful evaluation of the cost and ben- 
efits not only of the immediate action but also of choices we may have in the future. This 
evaluation becomes harder when the effects of actions are stochastic, so that we must pur- 
sue and evaluate many possible outcomes in parallel. Typically, the problem becomes more 
complex the further we look into the future. The situation becomes even worse when the 
outcomes we can observe are imperfect or unreliable indicators of the underlying process 
and special actions are needed to obtain more reliable information. Unfortunately, many 
real-world decision problems fall into this category. 

Consider, for example, a problem of patient management. The patient comes to the 
hospital with an initial set of complaints. Only rarely do these allow the physician (decision- 
maker) to diagnose the underlying disease with certainty, so that a number of disease options 
generally remain open after the initial evaluation. The physician has multiple choices in 
managing the patient. He/she can choose to do nothing (wait and see), order additional tests 
and learn more about the patient state and disease, or proceed to a more radical treatment 
(e.g. surgery). Making the right decision is not an easy task. The disease the patient suffers 
can progress over time and may become worse if the window of opportunity for a particular 
effective treatment is missed. On the other hand, selection of the wrong treatment may 
make the patient's condition worse, or may prevent applying the correct treatment later. 
The result of the treatment is typically non-deterministic and more outcomes are possible. 
In addition, both treatment and investigative choices come with different costs. Thus, in 
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a course of patient management, the decision-maker must carefully evaluate the costs and 
benefits of both current and future choices, as well as their interaction and ordering. Other 
decision problems with similar characteristics — complex temporal cost-benefit tradeoffs, 
stochasticity, and partial observability of the underlying controlled process — include robot 
navigation, target tracking, machine mantainance and replacement, and the like. 

Sequential decision problems can be modeled as Markov decision processes (MDPs) 
(Bellman, 1957; Howard, 1960; Puterman, 1994; Boutilier, Dean, & Hanks, 1999) and their 
extensions. The model of choice for problems similar to patient management is the partially 
observable Markov decision process (POMDP) (Drake, 1962; Astrom, 1965; Sondik, 1971; 
Lovejoy, 1991b). The POMDP represents two sources of uncertainty: stochasticity of the 
underlying controlled process (e.g. disease dynamics in the patient management problem), 
and imperfect observability of its states via a set of noisy observations (e.g. symptoms, 
findings, results of tests). In addition, it lets us model in a uniform way both control and 
information-gathering (investigative) actions, as well as their effects and cost-benefit trade- 
offs. Partial observability and the ability to model and reason with information-gathering 
actions are the main features that distinguish the POMDP from the widely known fully 
observable Markov decision process (Bellman, 1957; Howard, 1960). 

Although useful from the modeling perspective, POMDPs have the disadvantage of be- 
ing hard to solve (Papadimitriou &; Tsitsiklis, 1987; Liftman, 1996; Mundhenk, Goldsmith, 
Lusena, & Allender, 1997; Madani, Hanks, & Condon, 1999), and optimal or e-optimal solu- 
tions can be obtained in practice only for problems of low complexity. A challenging goal in 
this research area is to exploit additional structural properties of the domain and/or suitable 
approximations (heuristics) that can be used to obtain good solutions more efficiently. 

We focus here on heuristic approximation methods, in particular approximations based 
on value functions. Important research issues in this area are the design of new and efficient 
algorithms, as well as a better understanding of the existing techniques and their relations, 
advantages and disadvantages. In this paper we address both of these issues. First, we 
survey various value-function approximations, analyze their properties and relations and 
provide some insights into their differences. Second, we present a number of new methods 
and novel refinements of existing techniques. The theoretical results and findings are also 
supported empirically on a problem from the agent navigation domain. 

2. Partially Observable Markov Decision Processes 

A partially observable Markov decision process (POMDP) describes a stochastic control 
process with partially observable (hidden) states. Formally, it corresponds to a tuple 
{S, A, ©, T, O, R) where S" is a set of states, A is a set of actions, © is a set of observations, 
T : 5 X yl X 5 — >■ [0, 1] is a set of transition probabilities that describe the dynamic behavior 
of the modeled environment, O : S x AxQ ^ [0, 1] is a set of observation probabilities that 
describe the relationships among observations, states and actions, and R: SxAxS^M 
denotes a reward model that assigns rewards to state transitions and models payoffs asso- 
ciated with such transitions. In some instances the definition of a POMDP also includes an 
a priori probability distribution over the set of initial states 5". 
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Figure 1: Part of the influence diagram describing a POMDP model. Rectangles correspond 
to decision nodes (actions), circles to random variables (states) and diamonds to 
reward nodes. Links represent the dependencies among the components, st, at, ot 
and rt denote state, action, observation and reward at time t. Note that an action 
at time t depends only on past observations and actions, not on states. 



2.1 Objective Function 

Given a POMDP, the goal is to construct a control policy that maximizes an objective (value) 
function. The objective function combines partial (stepwise) rewards over multiple steps 
using various kinds of decision models. Typically, the models are cumulative and based on 
expectations. Two models are frequently used in practice: 

• a finite-horizon model in which we maximize E{Ylt=o where rt is a reward obtained 
at time t. 

• an infinite-horizon discounted model in which we maximize £^(5^^q 7*ri), where < 
7 < 1 is a discount factor. 

Note that POMDPs and cumulative decision models provide a rich language for modeling 
various control objectives. For example, one can easily model goal- achievement tasks (a 
specific goal must be reached) by giving a large reward for a transition to that state and 
zero or smaller rewards for other transitions. 

In this paper we focus primarily on discounted infinite-horizon model. However, the 
results can be easily applied also to the finite-horizon case. 

2.2 Information State 

In a POMDP the process states are hidden and we cannot observe them while making a 
decision about the next action. Thus, our action choices are based only on the informa- 
tion available to us or on quantities derived from that information. This is illustrated in 
the influence diagram in Figure 1, where the action at time t depends only on previous 
observations and actions, not on states. Quantities summarizing all information are called 
information states. Complete information states represent a trivial case. 
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Figure 2: Influence diagram for a POMDP with information states and corresponding 
information-state MDP. Information states {It and It+i) are represented by 
double-circled nodes. An action choice (rectangle) depends only on the current 
information state. 



Definition 1 (Complete information state). The complete information state at time t (de- 
noted if ) consists of: 

• a prior belief bo on states in S at time 0; 

• a complete history of actions and observations {oq, gq, oi, ai, • • • , Oi_i, ai_i, o^} start- 
ing from time t = 0. 

A sequence of information states defines a controlled Markov process that we call an 
information-state Markov decision process or information-state MDP. The policy for the 
information-state MDP is defined in terms of a control function : I ^ A mapping 
information state space to actions. The new information state (It) is a deterministic function 
of the previous state {It-i), the last action {at-i) and the new observation (oj): 

It = T{It-i,ot,at-i). 

t: IxQxA^Iis the update function mapping the information state space, observations 
and actions back to the information space. ^ It is easy to see that one can always convert 
the original POMDP into the information-state MDP by using complete information states. 
The relation between the components of the two models and a sketch of a reduction of a 
POMDP to an information-state MDP, are shown in Figure 2. 

2.3 Bellman Equations for POMDPs 

An information-state MDP for the infinite-horizon discounted case is like a fully-observable 
MDP and satisfies the standard fixed-point (Bellman) equation: 

V*{I) = max |p(/, a) + 7 E ^(^'1^' «)^*(^') | • (1) 

1. In this paper, t denotes the generic update function. Thus we use the same symbol even if the information 
state space is different. 
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Here, V*{I) denotes the optimal value function maximizing E{Y^'^q ")*rt) for state /. a) 
is the expected one-step reward and equals 

p{I,a) = Y,p{s,a)P{s\I) = Y.Y1 R{s,a,s')P{s'\s,a)P{s\I). 

sGS sGSs'eS 

p{s, a) denotes an expected one-step reward for state s and action a. 

Since the next information state I' = t{I. o. o.) is a deterministic function of the previous 
information state /, action a, and the observation o, the Equation 1 can be rewritten more 
compactly by summing over all possible observations Q: 

V*{I) = max I J2 ^)Pi^\I) + 7 E ^(^l-^' «)^* (^(^' «)) | • (2) 
Les oee J 

The optimal policy (control function) fx* : I ^ A selects the value-maximizing action 

= arg max | ^ p{s, a)P{s\I) + 7 E ^("l^. a)V*{T{I, o, a))] . (3) 

The value and control functions can be also expressed in terms of action-value functions 
( Q-functions) 

V*{I) = max Q* (/, a) n* (I) = arg max Q* (/, a) , 

Q*{I, a) = J2 Pi^, (^)P{s\I) + 7 E (^)V*{t{I, o, a)). (4) 

ses oee 

A Q-function corresponds to the expected reward for chosing a fixed action (a) in the first 
step and acting optimally afterwards. 

2.3.1 Sufficient Statistics 

To derive Equations 1 — 3 we implicitly used complete information states. However, as 
remarked earlier, the information available to the decision-maker can be also summarized 
by other quantities. We call them sufficient information states. Such states must preserve 
the necessary information content and also the Markov property of the information-state 
decision process. 

Definition 2 (Sufficient information state process). Let I be an information state space 
and T : I X A X Q ^ I be an update function defining an information process It = 
T{It-i,at-i,ot). The process is sufficient with regard to the optimal control when, for any 
time step t, it satisfies 

P{.st\It)=P{st\lf) 

P{ot\It-i,at-i) = P{ot\I^_i,at-i), 

where if and lf_i are complete information states. 

It is easy to see that Equations 1 — 3 for complete information states must hold also for 
sufficient information states. The key benefit of sufficient statistics is that they are often 
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easier to manipulate and store, since unlike complete histories, they may not expand with 
time. For example, in the standard POMDP model it is sufficient to work with belief states 
that assign probabilities to every possible process state (Astrom, 1965).^ In this case the 
Bellman equation reduces to: 

V{b) = max I p{s, a)b{s) + 7 E E a)b{s)V{T{b, o, a)) ] , (5) 

where the next-step belief state b' is 

b'{s) = T{b,o,a){s) = fjP{o\s,a) E P{s\a, s')b{s'). 

s'es 

P = l/P(o|6, a) is a normalizing constant. This defines a belief-state MDP which is a 
special case of a continuous-state MDP. Belief-state MDPs are also the primary focus of 
our investigation in this paper. 

2.3.2 Value- Function Mappings and their Properties 

The Bellman equation 2 for the belief-state MDP can be also rewritten in the value-function 
mapping form. Let V be a space of real-valued bounded functions V : I ^ M defined on 
the belief information space I, and let h : I x A x B ^ IR he defined as 

h{b, a, = E «)^(^) + 7 E E a)b{s)V{T{b, o, a)), 

ses oe&ses 

Now by defining the value function mapping H : V ^ V as {HV){b) = ma-XaeA h{b, a^V) , 
the Bellman equation 2 for all information states can be written as V* = HV*. It is well 
known that H (for MDPs) is an isotone mapping and that it is a contraction under the 
supremum norm (see (Heyman Sz Sobel, 1984; Puterman, 1994)). 

Definition 3 The mapping H is isotone, if V,U &V and V <U implies HV < HU. 

Definition 4 Let \\.\\ be a supremum norm. The mapping H is a contraction under the 
supremum norm, if for all V,U EV, \\HV — HU\\ < f3\\V — U\\ holds for some < /3 < 1. 

2.4 Value Iteration 

The optimal value function (Equation 2) or its approximation can be computed using dy- 
namic programming techniques. The simplest approach is the value iteration (Bellman, 
1957) shown in Figure 3. In this case, the optimal value function V* can be determined 
in the limit by performing a sequence of value-iteration steps Vj = HVi-i. where Vi is the 
ith approximation of the value function (zth value function).^ The sequence of estimates 

2. Models in which belief states arc not sufficient include POMDPs with observation and action channel 
lags (see Hauskrecht (1997)). 

3. We note that the same update Vi = HVi-i can be applied to solve the finite- horizon problem in a 
standard way. The difference is that Vi now stands for the i-steps-to-go value function and Vb represents 
the value function (rewards) for end states. 
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Value iteration {POMDP, e) 
initialize V for all 6 G X; 
repeat 

V ^ V; 

update V ^ HV' for all be I; 
until sup6 \V{b)- V'{b) \< e 
return V; 

Figure 3: Value iteration procedure. 



converges to the unique fixed-point solution which is the direct consequence of Banach's 
theorem for contraction mappings (see, for example, Puterman (1994)). 

In practice, we stop the iteration well before it reaches the limit solution. The stopping 
criterion we use in our algorithm (Figure 3) examines the maximum difference between value 
functions obtained in two consecutive steps — the so-called Bellman error (Puterman, 1994; 
Littman, 1996). The algorithm stops when this quantity falls below the threshold e. The 
accuracy of the approximate solution (ith value function) with regard to V* can be expressed 
in terms of the Bellman error e. 

Theorem 1 Let e = supj \Vi{b) — Vi-i{b)\ = \\Vi — Vi-i\\ be the magnitude of the Bellman 
error. Then \\V, - T^*|| < and \\Vi-i - V*\\ < hold. 

Then, to obtain the approximation of V* with precision S the Bellman error should fall 
below . 

7 

2.4.1 PiECEWisE Linear and Convex Approximations of the Value Function 

The major difficulty in applying the value iteration (or dynamic programming) to belief- 
state MDPs is that the belief space is infinite and we need to compute an update Vi = HVi-i 
for all of it. This poses the following threats: the value function for the ith step may not 
be representable by finite means and/or computable in a finite number of steps. 

To address this problem Sondik (Sondik, 1971; Smallwood &: Sondik. 1973) showed that 
one can guarantee the computability of the ith value function as well as its finite description 
for a belief-state MDP by considering only piecewise linear and convex representations of 
value function estimates (see Figure 4). In particular, Sondik showed that for a piecewise 
linear and convex representation of V^-i, Vi = HVi-i is computable and remains piecewise 
linear and convex. 

Theorem 2 (Piecewise linear and convex functions). Let Vb an initial value function 
that is piecewise linear and convex. Then the ith value function obtained after a finite 
number of update steps for a belief-state MDP is also finite, piecewise linear and convex, 
and is equal to: 

Vi{b) = max ^6(5)0:2(5), 
where b and on are vectors of size \S\ and Fj is a finite set of vectors (linear functions) ai. 
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Figure 4: A piecewise linear and convex function for a POMDP with two process states 
{51,52}- Note that 6(si) = 1 — b{s2) holds for any belief state. 



The key part of the proof is that we can express the update for the ith value function 
in terms of linear functions Fj-i defining 



Vi{b) = max <^ ^ p{s, a)b{s) + 7 I] max ^ 



J2P{s',o\s,a)b{s) 
.ses 



ai-,{s')\. (6) 



This leads to a piecewise linear and convex value function Vi that can be represented by 
a finite set of linear functions aj, one linear function for every combination of actions and 
permutations of a-i^i vectors of size |0|. Let W = (a, {oi, ajii}, {02, aji;^}, • • • {o|0|, a^l^j}) 
be such a combination. Then the linear function corresponding to it is defined as 



ar(5)=p(5,a)+7E E^(^''0|^'«)"l-i(^')- (7) 

oee s'es 

Theorem 2 is the basis of the dynamic programming algorithm for finding the optimal 
solution for the finite-horizon models and the value-iteration algorithm for finding near- 
optimal approximations of V* for the discounted, infinite-horizon model. Note, however, 
that this result does not imply piecewise linearity of the optimal (fixed-point) solution V*. 



2.4.2 Algorithms for Computing Value-Function Updates 



The key part of the value-iteration algorithm is the computation of value-function updates 
Vi = HVi-i. Assume an ith value function Vi that is represented by a finite number of linear 
segments {a vectors). The total number of all its possible linear functions is |y4||Fj_i|l®l (one 
for every combination of actions and permutations of Q;j_i vectors of size |8|) and they can 
be enumerated in 0(|j4||S'|^|rj;_i|l®l) time. However, the complete set of linear functions 
is rarely needed: some of the linear functions are dominated by others and their omission 
does not change the resulting piecewise linear and convex function. This is illustrated in 
Figure 5. 
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Figure 5: Redundant linear function. The function does not dominate in any of the regions 
of the belief space and can be excluded. 



A linear function that can be eliminated without changing the resulting value function 
solution is called redundant. Conversely, a linear function that singlehandedly achieves the 
optimal value for at least one point of the belief space is called useful."^ 

For the sake of computational efficiency it is important to make the size of the linear 
function set as small as possible (keep only useful linear functions) over value- iteration steps. 
There are two main approaches for computing useful linear functions. The first approach is 
based on a generate-and-test paradigm and is due to Sondik (1971) and Monahan (1982). 
The idea here is to enumerate all possible linear functions first, then test the usefulness 
of linear functions in the set and prune all redundant vectors. Recent extensions of the 
method interleave the generate and test stages and do early pruning on a set of partially 
constructed linear functions (Zhang & Liu, 1997a; Cassandra, Littman, & Zhang, 1997; 
Zhang & Lee, 1998). 

The second approach builds on Sondik's idea of computing a useful linear function for a 
single belief state (Sondik, 1971; Smallwood & Sondik, 1973), which can be done efficiently. 
The key problem here is to locate all belief points that seed useful linear functions and 
different methods address this problem differently. Methods that implement this idea are 
Sondik's one- and two-pass algorithms (Sondik, 1971), Cheng's methods (Cheng, 1988), and 
the Witness algorithm (Kaelbling, Littman, & Cassandra, 1999; Littman, 1996; Cassandra, 
1998). 

2.4.3 Limitations and Complexity 

The major difficulty in solving a belief-state MDP is that the complexity of a piecewise 
linear and convex function can grow extremely fast with the number of update steps. More 
specifically, the size of a linear function set defining the function can grow exponentially (in 
the number of observations) during a single update step. Then, assuming that the initial 
value function is linear, the number of linear functions defining the ith value function is 
0{\A\\^r'). 

4. In defining redundant and useful linear functions we assume that there are no linear function duplicates, 
i.e. only one copy of the same linear function is kept in the set Fi. 
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The potential growth of the size of the hnear function set is not the only bad news. As 
remarked earlier, a piecewise linear convex value function is usually less complex than the 
worst case because many linear functions can be pruned away during updates. However, 
it turned out that the task of identifying all useful linear functions is computationally 
intractable as well (Littman, 1996). This means that one faces not only the potential 
super-exponential growth of the number of useful linear functions, but also inefficiencies 
related to the identification of such vectors. This is a significant drawback that makes the 
exact methods applicable only to relatively simple problems. 

The above analysis suggests that solving a POMDP problem is an intrinsically hard 
task. Indeed, finding the optimal solution for the finite-horizon problem is PSPACE-hard 
(Papadimitriou &; Tsitsiklis, 1987). Finding the optimal solution for the discounted infinite- 
horizon criterion is even harder. The corresponding decision problem has been shown to be 
undecidable (Madani et al., 1999), and thus the optimal solution may not be computable. 

2.4.4 Structural Refinements of the Basic Algorithm 

The standard POMDP model uses a flat state space and full transition and reward matrices. 
However, in practice, problems often exhibit more structure and can be represented more 
compactly, for example, using graphical models (Pearl, 1988; Lauritzen, 1996), most often 
dynamic belief networks (Dean &: Kanazawa, 1989; Kjaerulff, 1992) or dynamic influence 
diagrams (Howard & Matheson, 1984; Tatman & Schachter, 1990).^ There are many ways 
to take advantage of the problem structure to modify and improve exact algorithms. For 
example, a refinement of the basic Monahan algorithm to compact transition and reward 
models has been studied by Boutilier and Poole (1996). A hybrid framework that combines 
MDP-POMDP problem-solving techniques to take advantage of perfectly and partially ob- 
servable components of the model and the subsequent value function decomposition was 
proposed by Hauskrecht (1997, 1998, 2000). A similar approach with perfect information 
about a region (subset of states) containing the actual underlying state was discussed by 
Zhang and Liu (1997b, 1997a). Finally, Castaiion (1997) and Yost (1998) explore techniques 
for solving large POMDPs that consist of a set of smaller, resource- coupled but otherwise 
independent POMDPs. 

2.5 Extracting Control Strategy 

Value iteration allow us to compute an ith approximation of the value function Vi. However, 
our ulimate goal is to find the optimal control strategy /i* : I ^ Aoi its close approximation. 
Thus our focus here is on the problem of extraction of control strategies from the results of 
value iteration. 

2.5.1 LOOKAHEAD DESIGN 

The simplest way to define the control function : I ^ A from the value function Vi is via 
greedy one-step lookahead: 



5. See the survey by Boutilier, Dean and Hanks (1999) for different ways to represent structured MDPs. 
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f 

Vi(b) 




b 1 b(si) 

Figure 6: Direct control design. Every linear function defining Vi is associated with an 
action. The action is selected if its linear function (or Q-function) is maximal. 



As Vi represents only the ith approximation of the optimal value function, the question 
arises how good the resulting controller really is.^ The following theorem (Puterman, 1994; 
Williams & Baird, 1994; Liftman, 1996) relates the accuracy of the (lookahead) controller 
and the Bellman error. 

Theorem 3 Let e = \\Vi — Vi-i\\ be the magnitude of the Bellman error. Let Vl^^ he the 
expected reward for the lookahead controller designed for Vi. Then \\Vi^^ ~ ^*ll ^ T^- 

The bound can be used to construct the value-iteration routine that yields a lookahead 
strategy with a minimum required precision. The result can be also extended to the k- 
step lookahead design in a straightforward way; with k steps, the error bound becomes 

W^i ^ II ^ (1-7)- 

2.5.2 Direct Design 

To extract the control action via lookahead essentially requires computing one full update. 
Obviously, this can lead to unwanted delays in reaction times. In general, we can speed up 
the response by remembering and using additional information. In particular, every linear 
function defining Vi is associated with the choice of action (see Equation 7). The action is a 
byproduct of methods for computing linear functions and no extra computation is required 
to find it. Then the action corresponding to the best linear function can be selected directly 
for any belief state. The idea is illustrated in Figure 6. 

The bound on the accuracy of the direct controller for the infinite-horizon case can be 
once again derived in terms of the magnitude of the Bellman error. 

Theorem 4 Let e = \\Vi — be the magnitude of the Bellman error. Let Vf^ be an 

expected reward for the direct controller designed for Vi. Then \\Vi^^ — V*\\ < j?^- 

The direct action choice is closely related to the notion of action-value function (or 
Q-function). Analogously to Equation 4, the ith Q-function satisfies 

Vi{b) = max Qi{b, a), 

aGA 

6. Note that the control action extracted via lookahead from V; is optimal for (; + 1) steps-to-go and the 
finite-horizon model. The main difference here is that Vi is the optimal value function for i steps to go. 
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Figure 7: A policy graph (finite-state machine) obtained after two value iteration steps. 

Nodes correspond to linear functions (or states of the finite-state machine) and 
links to dependencies between linear functions (transitions between states). Every 
linear function (node) is associated with an action. To ensure that the policy can 
be also applied to the infinite-horizon problem, we add a cycle to the last state 
(dashed line). 



Qi{b, a) = R{b, a) + 7 E ^(''l^' a)Vi-i{T{b, a, o)). 

Prom this perspective, the direct strategy selects the action with the best (maximum) Q- 
function for a given belief state. 

2.5.3 Finite-State Machine Design 

A more complex refinement of the above technique is to remember, for every linear function 
in Fj, not only the action choice but also the choice of a linear function for the previous 
step and to do this for all observations (see Equation 7). As the same idea can be applied 
recursively to the linear functions for all previous steps, we can obtain a relatively complex 
dependency structure relating linear functions in Vi,Vi-i, ■ ■ ■ Vq, observations and actions 
that itself represents a control strategy (Kaelbling et al., 1999). 

To see this, we model the structure in graphical terms (Figure 7). Here different nodes 
represent linear functions, actions associated with nodes correspond to optimizing actions, 
links emanating from nodes correspond to different observations, and successor nodes corre- 
spond to linear functions paired with observations. Such graphs are also called policy graphs 
(Kaelbling et al., 1999; Littman, 1996; Cassandra, 1998). One interpretation of the depen- 
dency structure is that it represents a collection of finite-state machines (FSMs) with many 
possible initial states that implement a POMDP controller: nodes correspond to states of 
the controller, actions to controls (outputs), and links to transitions conditioned on inputs 

7. Williams and Baird (1994) also give results relating the accuracy of the direct Q-function controller to 
the Bellman error of Q-functions. 
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(observations). The start state of the FSM controller is chosen greedily by selecting the 
linear function (controller state) optimizing the value of an initial belief state. 

The advantage of the finite-state machine representation of the strategy is that for the 
first i steps it works with observations directly; belief-state updates are not needed. This 
contrasts with the other two policy models (lookahead and direct models), which must keep 
track of the current belief state and update it over time in order to extract appropriate 
control. The drawback of the approach is that the FSM controller is limited to i steps 
that correspond to the number of value iteration steps performed. However, in the infinite- 
horizon model the controller is expected to run for an infinite number of steps. One way 
to remedy this deficiency is to extend the FSM structure and to create cycles that let us 
visit controller states repeatedly. For example, adding a cycle transition to the end state of 
the FSM controller in Figure 7 (dashed line) ensures that the controller is also applicable 
to the infinite-horizon problem. 

2.6 Policy Iteration 

An alternative method for finding the solution for the discounted infinite-horizon problem 
is policy iteration (Howard, 1960; Sondik, 1978). Policy iteration searches the policy space 
and gradually improves the current control policy for one or more belief states. The method 
consists of two steps performed iteratively: 

• policy evaluation: computes expected value for the current policy; 

• policy improvement: improves the current policy. 

As we saw in Section 2.5, there are many ways to represent a control policy for a 
POMDP. Here we restrict attention to a finite-state machine model in which observations 
correspond to inputs and actions to outputs (Platzman, 1980; Hansen, 1998b; Kaelbling 
et al., 1999).^ 

2.6.1 Finite-State Machine Controller 

A finite-state machine (FSM) controller C = {M,Q, A,4>,ri,'Ll^) for a POMDP is described 
by a set of memory states M of the controller, a set of observations (inputs) 0, a set of 
actions (outputs) A, a transition function (p : M x Q ^ M mapping states of the FSM to 
next memory states given the observation, and an output function rj : M ^ A mapping 
memory states to actions. A function ■0 : Xq — )■ M selects the initial memory state given 
the initial information state. The initial information state corresponds either to a prior or 
a posterior belief state at time io depending on the availability of an initial observation. 

2.6.2 Policy Evaluation 

The first step of the policy iteration is policy evaluation. The most important property 
of the FSM model is that the value function for a specific FSM strategy can be computed 
efficiently in the number of controller states M. The key to efficient computability is the 

8. A policy-iteration algorithm in which policies are defined over the regions of the belief space was described 
first by Sondik (1978). 
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Figure 8: An example of a four-state FSM policy. Nodes represent states, links transi- 
tions between states (conditioned on observations). Every memory state has an 
associated control action (output). 



fact that the value function for executing an FSM strategy from some memory state x is 
linear (Platzman, 1980).^ 

Theorem 5 LeA C he a finite-state machine controller with a set of memory states M . 
The value function for applying C from a memory state x G M, V'"{x,h), is linear. Value 
functions for all x ^ M can he found hy solving a system of linear equations with \S\\M\ 
variahles. 

We illustrate the main idea by an example. Assume an FSM controller with four memory 
states {xi,X2,X2,,X4], as in Figure 8, and a stochastic process with two hidden states S = 
{si, 52}. The value of the policy for an augmented state space S x M satisfies a system of 
linear equations 

V{xi,si) = p(si,?7(a;i)) +7 ^ ^P(o,s|si,?7(a;i))F(0(a;i,o),s) 

oee ses 

V{xi,S2) = p(s2,r/(a:i)) +7 ^ ^P(o,5|52,r?(a:i))y(0(a:i,o),s) 

oee sgs 

V{x2,si) = p(si,?7(a;2)) +7 XI X!'^('''*l*i'^(^2))V^(0(a:2,o),s) 

oee ses 

V{x4,S2) = p(s2,??(a:4)) +7 XI X!'^('''*l*2,??(a:4))V^(0(a:4,o),s), 

oee ses 

where ri{x) is the action executed in x and (f)(x,o) is the state to which one transits after 
seeing an input (observation) o. Assuming we start the policy from the memory state xi, 
the value of the policy is: 

V^{x,,b) = Y,V{x,,s)b{s). 

ses 

9. The idea of linearity and efficient computability of the value functions for a fixed FSM-based strategy 
has been addressed recently in different contexts by a number of researchers (Littman, 1996; Cassandra, 
1998; Hauskrecht, 1997; Hansen, 1998b; Kaelbling et al., 1999). However, the origins of the idea can be 
traced to the earlier work by Platzman (1980). 
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Thus the value function is hnear and can be computed efficiently by solving a system of 
linear equations. 

Since in general the FSM controller can start from any memory state, we can always 
choose the initial memory state greedily, maximizing the expected value of the result. In 
such a case the optimal choice function ip is defined as: 

xp(b) = arg max y'^frc, 6), 

and the value for the FSM policy C and belief state b is: 

V'^(b) =ma-xV^(x,b) = V^(Mb),b). 

Note that the resulting value function for the strategy C is piecewise linear and convex and 
represents expected rewards for following C. Since no strategy can perform better that the 
optimal strategy, < V* must hold. 

2.6.3 Policy Improvement 

The policy-iteration method, searching the space of controllers, starts from an arbitrary ini- 
tial policy and improves it gradually by refining its finite-state machine (FSM) description. 
In particular, one keeps modifying the structure of the controller by adding or removing con- 
troller states (memory) and transitions. Let C and C be an old and a new FSM controller. 
In the improvement step we must satisfy 

V'^'ib) > V^{b) for all 6 G X; 

3b el such that V^' {b) > V^{b). 

To guarantee the improvement, Hansen (1998a, 1998b) proposed a policy-iteration algo- 
rithm that relies on exact value function updates to obtain a new improved policy struc- 
ture. The basic idea of the improvement is based on the observation that one can switch 
back and forth between the FSM policy description and the piecewise-linear and convex 
representation of a value function. In particular: 

• the value function for an FSM policy is piecewise-linear and convex and every linear 
function describing it corresponds to a memory state of a controller; 

• individual linear functions comprising the new value function after an update can be 
viewed as new memory states of an FSM policy, as described in Section 2.5.3. 

This allows us to improve the policy by adding new memory states corresponding to linear 
functions of the new value function obtained after the exact update. The technique can be 
refined by removing some of the linear functions (memory states) whenever they are fully 
dominated by one of the other linear functions. 

10. A policy-iteration algorithm that exploits exact value function updates but works with policies defined 
over the belief space was used earlier by Sondik (1978). 
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Figure 9: A two-step decision tree. Rectangles correspond to the decision nodes (moves 
of the decision-maker) and circles to chance nodes (moves of the environment). 
Black rectangles represent leaves of the tree. The reward for a specific path 
is associated with every leaf of the tree. Decision nodes are associated with 
information states obtained by following action and observation choices along the 
path from the root of the tree. For example, 61^1 is a belief state obtained by 
performing action ai from the initial belief state b and observing observation oi. 



2.7 Forward (Decision Tree) Methods 

The methods discussed so far assume no prior knowledge of the initial belief state and treat 
all belief states as equally likely. However, if the initial state is known and fixed, methods 
can often be modified to take advantage of this fact. For example, for the finite-horizon 
problem, only a finite number of belief states can be reached from a given initial state. In 
this case it is very often easier to enumerate all possible histories (sequences of actions and 
observations) and represent the problem using stochastic decision trees (Raiffa, 1970). An 
example of a two-step decision tree is shown in Figure 9. 

The algorithm for solving the stochastic decision tree basically mimics value-function 
updates, but is restricted only to situations that can be reached from the initial belief state. 
The key difficulty here is that the number of all possible trajectories grows exponentially 
with the horizon of interest. 

2.7.1 Combining Dynamic-Programming and Decision-Tree Techniques 

To solve a POMDP for a fixed initial belief state, we can apply two strategies: one con- 
structs the decision tree first and then solves it, the other solves the problem in a backward 
fashion via dynamic programming. Unfortimately, both these techniques are inefficient, one 
suffering from exponential growth in the decision tree size, the other from super-exponential 
growth in the value function complexity. However, the two techniques can be combined in 
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a way that at least partially eliminates their disadvantages. The idea is based on the fact 
that the two techniques work on the solution from two different sides (one forward and the 
other backward) and the complexity for each of them worsens gradually. Then the solution 
is to compute the complete A;th value function using dynamic programming (value iteration) 
and cover the remaining steps by forward decision-tree expansion. 

Various modifications of the above idea are possible. For example, one can often replace 
exact dynamic programming with two more efficient approximations providing upper and 
lower bounds of the value function. Then the decision tree must be expanded only when 
the bounds are not sufficient to determine the optimal action choice. A number of search 
techniques developed in the AI literature (Korf, 1985) combined with branch-and-bound 
pruning (Satia &; Lave, 1973) can be applied to this type of problem. Several researchers 
have experimented with them to solve POMDPs (Washington, 1996; Hauskrecht, 1997; 
Hansen, 1998b). Other methods applicable to this problem are based on Monte-Carlo 
sampling (Kearns, Mansour, & Ng, 1999; McAllester & Singh, 1999) and real-time dynamic 
programming (Barto, Bradtke, & Singh, 1995; Dearden & Boutilier, 1997; Bonet & Geffner, 

1998) . 

2.7.2 Classical Planning Framework 

POMDP problems with fixed initial belief states and their solutions are closely related to 
work in classical planning and its extensions to handle stochastic and partially observable 
domains, particularly the work on BURIDAN and C-BURIDAN planners (Kushmerick, 
Hanks, Sz Weld, 1995; Draper, Hanks, Sz Weld, 1994). The objective of these planners is 
to maximize the probability of reaching some goal state. However, this task is similar to 
the discounted reward task in terms of complexity, since a discounted reward model can 
be converted into a goal-achievement model by introducing an absorbing state (Condon, 
1992). 

3. Heuristic Approximations 

The key obstacle to wider application of the POMDP framework is the computational 
complexity of POMDP problems. In particular, finding the optimal solution for the finite- 
horizon case is PSPACE-hard (Papadimitriou Sz Tsitsiklis, 1987) and the discounted infinite- 
horizon case may not even be computable (Madani et al., 1999). One approach to such 
problems is to approximate the solution to some e-precision. Unfortunately, even this 
remains intractable and in general POMDPs cannot be approximated efficiently (Burago, 
Rougemont, Sz Slissenko, 1996; Lusena. Goldsmith, & Mundhenk, 1998; Madani et al., 

1999) . This is also the reason why only very simple problems can be solved optimally or 
near-optimally in practice. 

To alleviate the complexity problem, research in the POMDP area has focused on various 
heuristic methods (or approximations without the error parameter) that are more efficient. 
Heuristic methods are also our focus here. Thus, when referring to approximations, we mean 
heuristics, unless specifically stated otherwise. 

11. The quality of a heuristic approximation can be tested using the Bellman error, which requires one exact 
update step. However, heuristic methods per se do not contain a precision parameter. 
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The many approximation methods and their combinations can be divided into two often 
very closely related classes: value-function approximations and policy approximations. 

3.1 Value-Function Approximations 

The main idea of the value- function approximation approach is to approximate the optimal 
value function V : I ^ IR with a function V : X ^ M defined over the same information 
space. Typically, the new function is of lower complexity (recall that the optimal or near- 
optimal value function may consist of a large set of linear functions) and is easier to compute 
than the exact solution. Approximations can be often formulated as dynamic programming 
problems and can be expressed in terms of approximate value-function updates H. Thus, 
to understand the differences and advantages of various approximations and exact methods, 
it is often sufficient to analyze and compare their update rules. 

3.1.1 Value- Function Bounds 

Although heuristic approximations have no guaranteed precision, in many cases we are 
able to say whether they overestimate or underestimate the optimal value function. The 
information on bounds can be used in multiple ways. For example, upper- and lower- 
bounds can help in narrowing the range of the optimal value function, elimination of some 
of the suboptimal actions and subsequent speed-ups of exact methods. Alternatively, one 
can use knowledge of both value-function bounds to determine the accuracy of a controller 
generated based on one of the bounds (see Section 3.1.3). Also, in some instances, a lower 
bound alone is sufficient to guarantee the control choice that always achieves an expected 
reward at least as high as the one given by that bound (Section 4.7.2). 

The bound property of different methods can be determined by examining the updates 
and their bound relations. 

Definition 5 (Upper hound). Let H be the exact value-function mapping and H its ap- 
proximation. We say that H upper-bounds H for some V when {HV){b) > {HV){b) holds 
for every b El. 

An analogous definition can be constructed for the lower bound. 

3.1.2 Convergence of Approximate Value Iteration 

Let H he a value-function mapping representing an approximate update. Then the ap- 
proximate value iteration computes the ith value function as Vi = HVi-i. The fixed-point 
solution V* = HV* or its close approximation would then represent the intended output of 
the approximation routine. The main problem with the iteration method is that in general 
it can converge to unique or multiple solutions, diverge, or oscillate, depending on H and 
the initial function Vq. Therefore, unique convergence cannot be guaranteed for an arbitrary 
mapping H and the convergence of a specific approximation method must be proved. 

Definition 6 (Convergence of H). The value iteration with H converges for a value func- 
tion Vq when lim„_j.oo(-ff"Vb) exists. 
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Definition 7 (Unique convergence of H). The value iteration converges uniquely for V 
when for every V E V, lim„_>.oo(-ff"V^) exists and for all pairs V,U E V, lim„_>.oo(-ff"V^) = 
lim„^oo(^"f/). 

A sufficient condition for tiie unique convergence is to show that H he a contraction. The 
contraction and the bound properties of H can be combined, under additional conditions, to 
show the convergence of the iterative approximation method to the bound. To address this 
issue we present a theorem comparing fixed-point solutions of two value-function mappings. 

Theorem 6 Let Hi and H2 be two value-function mappings defined on Vi and V2 such that 

1. Hi, H2 are contractions with fixed points Vi , V2 ; 

2. e V2 and H2V1* > HiV^" = V{ ; 

3. H2 is an isotone mapping. 
Then > holds. 

Note that this theorem does not require that Vi and V2 cover the same space of value 
functions. For example, V2 can cover all possible value functions of a belief-state MDP, 
while Vi can be restricted to a space of piecewise linear and convex value functions. This 
gives us some flexibility in the design of iterative approximation algorithms for computing 
value-function bounds. An analogous theorem also holds for the lower bound. 

3.1.3 Control 

Once the approximation of the value-function is available, it can be used to generate a 
control strategy. In general, control solutions correspond to options presented in Section 
2.5 and include lookahead, direct (Q-function) and finite-state machine designs. 

A drawback of control strategies based on heuristic approximations is that they have 
no precision guarantee. One way to find the accuracy of such strategies is to do one exact 
update of the value function approximation and adopt the result of Theorems 1 and 3 for 
the Bellman error. An alternative solution to this problem is to bound the accuracy of 
such controllers using the upper- and the lower-bound approximations of the optimal value 
function. To illustrate this approach, we present and prove (in the Appendix) the following 
theorem that relates the quality of bounds to the quality of a lookahead controller. 

Theorem 7 Let Vu and Vl he upper and lower hounds of the optimal value function for 
the discounted infi,nite-horizon prohlem. Let e = supfj\Vu{b) — Vl(6)| = \\Vu — Vl\\ he 
the maximum hound difference. Then the expected reward for a lookahead controller V^^, 
constructed for either Vu or Vl, satisfies WV^"^ ~ ^*ll ^ '(1—7)^ • 

3.2 Policy Approximation 

An alternative to value-function approximation is policy approximation. As shown earlier, 
a strategy (controller) for a POMDP can be represented using a finite-state machine (FSM) 
model. The policy iteration searches the space of all possible policies (FSMs) for the opti- 
mal or near-optimal solution. This space is usually enormous, which is the bottleneck of the 
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method. Thus, instead of searching the complete pohcy space, we can restrict our attention 
only to its subspace that we believe to contain the optimal solution or a good approxima- 
tion. Memoryless policies (Platzman, 1977; White &; Scherer, 1994; Littman, 1994; Singh, 
Jaakkola, Sz Jordan, 1994), policies based on truncated histories (Platzman, 1977; White Sz 
Scherer, 1994; McCallum, 1995), or finite-state controllers with a fixed number of memory 
states (Platzman, 1980; Hauskrecht, 1997; Hansen, 1998a, 1998b) are all examples of a 
policy-space restriction. In the following we consider only the finite-state machine model 
(see Section 2.6.1), which is quite general; other models can be viewed as its special cases. 

States of an FSM policy model represent the memory of the controller and, in general, 
summarize information about past activities and observations. Thus, they are best viewed 
as approximations of the information states, or as feature states. The transition model of 
the controller (0) then approximates the update function of the information-state MDP 
(r) and the output function of an FSM {rj) approximates the control function (ju) mapping 
information states to actions. The important property of the model, as shown Section 
2.6.2, is that the value function for a fixed controller and fixed initial memory state can be 
obtained efficiently by solving a system of linear equations (Platzman, 1980). 

To apply the policy approximation approach we first need to decide (1) how to restrict 
a space of policies and (2) how to judge the policy quality. 

A restriction frequently used is to consider only controllers with a fixed number of 
states, say k. Other structural restrictions further narrowing the space of policies can 
restrict either the output function (choice of actions at different controller states), or the 
transitions between the current and next states. In general, any heuristic or domain-related 
insight may help in selecting the right biases. 

Two different policies can yield value functions that are better in different regions of 
the belief space. Thus, in order to decide which policy is the best, we need to define the 
importance of different regions and their combinations. There are multiple solutions to this. 
For example, Platzman (1980) considers the worst-case measure and optimizes the worst 
(minimal) value for all initial belief states. Let C be a space of FSM controllers satisfying 
given restrictions. Then the quality of a policy under the worst case measure is: 

maxmin max V'"(x,b). 
cgc bei xgMc 

Another option is to consider a distribution over all initial belief states and maximize the 
expectation of their value function values. However, the most common objective is to choose 
the policy that leads to the best value for a single initial belief state bo: 

max max V'^ix^bo). 

Finding the optimal policy for this case reduces to a combinatorial optimization problem. 
Unfortunately, for all but trivial cases, even this problem is computationally intractable. 
For example, the problem of finding the optimal policy for a memoryless case (only cur- 
rent observations are considered) is NP-hard (Littman, 1994). Thus, various heuristics are 
typically applied to alleviate this difficulty (Littman, 1994). 
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Figure 10: Value-function approximation methods. 



3.2.1 Randomized Policies 

By restricting the space of policies we simplify the policy optimization problem. On the 
other hand, we simultaneously give up an opportunity to find the best optimal policy, replac- 
ing it with the best restricted policy. Up to this point, we have considered only deterministic 
policies with a fixed number of internal controller states, that is, policies with deterministic 
output and transition functions. However, finding the best deterministic policy is not al- 
ways the best option: randomized policies, with randomized output and transition functions, 
usually lead to the far better performance. The application of randomized (or stochastic) 
policies to POMDPs was introduced by Platzman (1980). Essentially, any deterministic 
policy can be represented as a randomized policy with a single action and transition, so 
that the best randomized policy is no worse than the best deterministic policy. The differ- 
ence in control performance of two policies shows up most often in cases when the number 
of states of the controller is relatively small compared to that in the optimal strategy. 

The advantage of stochastic policies is that their space is larger and parameters of 
the policy are continuous. Therefore the problem of finding the optimal stochastic policy 
becomes a non-linear optimization problem and a variety of optimization methods can be 
applied to solve it. An example is the gradient-based approach (see Meuleau et al., 1999). 

4. Value-Function Approximation Methods 

In this section we discuss in more depth value-function approximation methods. We fo- 
cus on approximations with belief information space. We survey known techniques, but 
also include a number of new methods and modifications of existing methods. Figure 10 
summarizes the methods covered. We describe the methods by means of update rules they 

12. Alternative value-function approximations may work with complete histories of past actions and obser- 
vations. Approximation methods used by White and Scherer (1994) are an example. 
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Figure 11: Test example. The maze navigation problem: Maze20. 



implement, which simphfies their analysis and theoretical comparison. We focus on the fol- 
lowing properties: the complexity of the dynamic-programming (value-iteration) updates; 
the complexity of value functions each method uses; the ability of methods to bound the 
exact update; the convergence of value iteration with approximate update rules; and the 
control performance of related controllers. The results of the theoretical analysis are illus- 
trated empirically on a problem from the agent-navigation domain. In addition, we use the 
agent navigation problem to illustrate and give some intuitions on other characteristics of 
methods with no theoretical underpinning. Thus, these results should not be generalized 
to other problems or used to rank different methods. 

Agent- Navigation Problem 

Maze20 is a maze- navigation problem with 20 states, six actions and eight observations. 
The maze (Figure 11) consists of 20 partially connected rooms (states) in which a robot 
operates and collects rewards. The robot can move in four directions (north, south, east 
and west) and can check for the presence of walls using its sensors. But, neither "move" 
actions nor sensor inputs are perfect, so that the robot can end up moving in unintended 
directions. The robot moves in an unintended direction with probability of 0.3 (0.15 for 
each of the neighboring directions). A move into the wall keeps the robot in the same 
position. Investigative actions help the robot to navigate by activating sensor inputs. Two 
such investigative actions allow the robot to check inputs (presence of a wall) in the north- 
south and east-west directions. Sensor accuracy in detecting walls is 0.75 for a two-wall 
case (e.g. both north and south wall), 0.8 for a one- wall case (north or south) and 0.89 for 
a no-wall case, with smaller probabilities for wrong perceptions. 

The control objective is to maximize the expected discounted rewards with a discount 
factor of 0.9. A small reward is given for every action not leading to bumping into the wall 
(4 points for a move and 2 points for an investigative action), and one large reward (150 
points) is given for achieving the special target room (indicated by the circle in the figure) 
and recognizing it by performing one of the move actions. After doing that and collecting 
the reward, the robot is placed at random in a new start position. 

Although the Maze20 problem is of only moderate complexity with regard to the size 
of state, action and observation spaces, its exact solution is beyond the reach of current 
exact methods. The exact methods tried on the problem include the Witness algorithm 
(Kaelbliug et al.. 1999). the incremental pruning algorithm (Cassandra et al., 1997)^^ and 

13. Many thanks to Anthony Cassandra for running these algorithms. 
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Figure 12: Approximations based on the fully observable version of a two state POMDP 
(with states .si, .§2): (a) the MDP approximation; (b) the QMDP approximation. 
Values at extreme points of the belief space are solutions of the fully observable 
MDP. 



policy iteration with an FSM model (Hansen, 1998b). The main obstacle preventing these 
algorithms from obtaining the optimal or close-to-optimal solution was the complexity of 
the value function (the number of linear functions needed to describe it) and subsequent 
running times and memory problems. 

4.1 Approximations with Fully Observable MDP 

Perhaps the simplest way to approximate the value function for a POMDP is to assume 
that states of the process are fully observable (Astrom, 1965; Lovejoy, 1993). In that case 
the optimal value function V* for a POMDP can be approximated as: 

V{b) = J2b{s)V^Dp{s), (8) 

where V^jjp{s) is the optimal value function for state s for the fully observable version of 
the process. We refer to this approximation as to the MDP approximation. The idea of 
the approximation is illustrated in Figure 12a. The resulting value function is linear and 
is fully defined by values at extreme points of the belief simplex. These correspond to the 
optimal values for the fully observable case. The main advantage of the approximation 
is that the fully observable MDP (FOMDP) can be solved efficiently for both the finite- 
horizon problem and discounted infinite-horizon problems. The update step for the (fully 
observable) MDP is: 

Vj^f^is) = max L(.,a) + 7 E P{s'\s,a)VJ^^^{s') 
[ s'es 

14. The solution for the finite-state fully observable MDP and discounted infinite-horizon criterion can be 
found efficiently by formulating an equivalent linear programming task (Bertsekas, 1995) 
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4.1.1 MDP Approximation 

The MDP-approximation approach (Equation 8) can be also described in terms of value- 
function updates for the belief-space MDP. Although this step is strictly speaking redundant 
here, it simplifies the analysis and comparison of this approach to other approximations. 

Let Vi be a linear value function described by a vector af^^^ corresponding to values 
of Vj^^^{s') for all states s' G S. Then the {i + l)th value function Vi+i is 



= {HMDpVi){b). 

is described by a linear function with components 

= VJ^ris) = max(p(.,«) + 7E^(-'l^'«)"f ''''(^')| • 
I ses ) 



The MDP-based rule Hmdp can be also rewritten in a more general form that starts from 
an arbitrary piecewise linear and convex value function Vi, represented by a set of linear 
functions Tf. 

Vi-^-l{b) = b{s) max < p{s, a) + ^ P{s'\s, a) max ai{s') 

The application of the Hmdp mapping always leads to a linear value function. The 
update is easy to compute and takes 0(|74||5'|^ + |rj||5'|) time. This reduces to Od^lUSp) 
time when only MDP-based updates are strung together. As remarked earlier, the optimal 
solution for the infinite-horizon, discounted problem can be solved efficiently via linear 
programming. 

The update for the MDP approximation upper-bounds the exact update, that is, HVi < 
HMDpVi- We show this property later in Theorem 9, which covers more cases. The intuition 
is that we cannot get a better solution with less information, and thus the fully observable 
MDP must upper-bound the partially observable case. 

4.1.2 Approximation with Q-Functions (QMDP) 

A variant of the approximation based on the fully observable MDP uses Q-functions (Liftman, 
Cassandra, Sz Kaelbling, 1995): 

V{b) = max ^ Ks)Q*MDpis, «), 

where 

Q*MDp{s, a) = p{s, a) + 7 E -^(^'l'^' a)V^jjp{s') 

s'es 

is the optimal action-value function (Q-function) for the fully observable MDP. The QMDP 
approximation V is piecewise linear and convex with |^| linear functions, each corresponding 
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to one action (Figure 12b). The QMDP update rule (for the behef state MDP) for Vi with 
hnear functions G Fj is: 



p{s, a) + 7 ^ P{s'\s, a) max ai{s') 
s'es 



= iHQMDpVi){b). 

Hqmdp generates a value function with \A\ linear functions. The time complexity of the 
update is the same as for the MDP-approximation case - Od^HS"!^ + |Fj||S'|), which reduces 
to 0(|yl||S'|^) time when only QMDP updates are used. Hqmdp is a contraction mapping 
and its fixed-point solution can be found by solving the corresponding fully observable MDP. 

The QMDP update upper-bounds the exact update. The bound is tighter than the 
MDP update; that is, HVi < nQMOpVi < HMDpVi, as we prove later in Theorem 9. The 
same inequalities hold for both fixed-point solutions (through Theorem 6). 

To illustrate the difference in the quality of bounds for the MDP approximation and 
the QMDP method, we use our Maze20 navigation problem. To measure the quality of a 
bound we use the mean of value- function values. Since all belief states are equally important 
we assume that they are uniformly distributed. We approximate this measure using the 
average of values for a fixed set oi N = 2000 belief points. The points in the set were 
selected uniformly at random at the beginning. Once the set was chosen, it was fixed 
and remained the same for all tests (here and later). Figure 13 shows the results of the 
experiment; we include also results for the fast informed bound method that is presented in 
the next section. Figure 13 also shows the running times of the methods. The methods 
were implemented in Common Lisp and run on Sun Ultra 1 workstation. 

4.1.3 Control 

The MDP and the QMDP value-function approximations can be used to construct con- 
trollers based on one-step lookahead. In addition, the QMDP approximation is also suitable 
for the direct control strategy, which selects an action corresponding to the best (highest 
value) Q-function. Thus, the method is a special case of the Q-function approach discussed 
in Section 3.1.3.^^ The advantage of the direct QMDP method is that it is faster than both 
lookahead designs. On other the hand, lookahead tends to improve the control performance. 
This is shown in Figure 14, which compares the control performance of different controllers 
on the Maze20 problem. 

The quality of a policy vf, with no preference towards a particular initial belief state, can 
be measured by the mean of value-function values for tt and uniformly distributed initial 
belief states. We approximate this measure using the average of discounted rewards for 

15. The confidence interval limits for probability level 0.95 range in ±(0.45, 0.62) from their respective 
average scores and this holds for all bound experiments in the paper. As these are relatively small we 
do not include them in our graplis. 

16. As pointed out by Littman et al. (1995), in some instances, the direct QMDP controller never selects 
investigative actions, that is, actions that try to gain more information about the underlying process 
state. Note, liowcvcr, tliat tliis observation is not true in general and tlic QMDP-bascd controller with 
direct action selection may select investigative actions, even though in the fully observable version of the 
problem investigative actions are never chosen. 
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Figure 13: Comparison of the MDP, QMDP and fast informed bound approximations: 
bound quality (left); running times (right). The bound-quality score is the 
average value of the approximation for the set of 2000 belief points (chosen uni- 
formly at random). As the methods upper-bound the optimal value function, we 
flip the bound-quality graph so that longer bars indicate better approximations. 



2000 control trajectories obtained for the fixed set of iV = 2000 initial belief states (selected 
uniformly at random at the beginning). The trajectories were obtained through simulation 
and were 60 steps long.^^ 

To validate the comparison along the averaged performance scores, we must show that 
these scores are not the result of randomness and that methods are indeed statistically 
significantly different. To do this we rely on pairwise significance tests. ^® To summarize the 
obtained results, the score differences of 1.54, 2.09 and 2.86 between any two methods (here 
and also later in the paper) are sufficient to reject the method with a lower score being 
the better performer at significance levels 0.05, 0.01 and 0.001 respectively.^^ Error-bars in 
Figure 14 reflect the critical score difference for the significance level 0.05. 

Figure 14 also shows the average reaction times for different controllers during these 
experiments. The results show the clear dominance of the direct QMDP controller, which 
need not do a lookahead in order to extract an action, compared to the other two MDP- 
based controllers. 

4.2 Fast Informed Bound Method 

Both the MDP and the QMDP approaches ignore partial observability and use the fully 
observable MDP as a surrogate. To improve these approximations and account (at least to 

17. The length of the trajectories (60 steps) for the Maze20 problem was chosen to ensure that our estimates 
of (discounted) cumulative rewards are not far from the actual rewards for an infinite number of steps. 

18. An alternative way to compare two methods is to compute confidence limits for their scores and inspect 
their overlaps. However, in this case, the ability to distinguish two methods can be reduced due to 
fiuctuations of scores for different initializations. For Maze20, confidence interval limits for probability 
level 0.95 range in ±(1.8,2.3) from their respective average scores. This covers all control experiments 
here and later. Pairwise tests eliminate the dependency by examining the difi^erences of individual values 
and thus improve the discriminative power. 

19. The critical score differences listed cover the worst case combination. Thus, there may be some pairs for 
which the smaller difference would suffice. 
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Figure 14: Comparison of control performance of the MDP, QMDP and fast informed bound 
methods: quality of control (left); reaction times (right). The quality-of-control 
score is the average of discounted rewards for 2000 control trajectories obtained 
for the fixed set of 2000 initial belief states (selected uniformly at random). 
Error-bars show the critical score difference value (1.54) at which any two meth- 
ods become statistically different at significance level 0.05. 



some degree) for partial observability we propose a new method - the fast informed bound 
method. Let Vi be a piecewise linear and convex value function represented by a set of linear 
functions Fj. The new update is defined as 

Vi+i{b) = max-^ ^p(s,a)6(s) ^ max ^ P(s',o|s,a)6(s)Q;i(s') 



ses 



max < b(s) 
{HpiBViKb). 



p{s, a) + 7 ^ max ^ P{s' , o\s, a)ai{s') 



oGB s'GS 



The fast informed bound update can be obtained from the exact update by the following 
derivation: 



{HVi){b) = max p{s, a)b{s) + 7 max Pjs , o\s, a)b{s)ai{s' 



< max ^ p{s, a)b{s) + 7 ^ ^ max ^ Pis' , o\s, a)b{s)ai{s') 



max 



ses oe& ses°'''^^' s'es 



b{s) p{s, a) + 7 XI -^("^'' ^l'^' 

ses oe&°'' 's'es 



" ses 
= [HpiBViKb). 

The value function Fj+i = HpisVi one obtains after an update is piecewise linear and 
convex and consists of at most \A\ different linear functions, each corresponding to one 
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action 

at+i{s) = p{s,a) +jY1 Yj P{s',o\s,a)ai{s'). 

The HpiB update is efficient and can be computed in 0(|^||5p|0||ri|) time. As the method 
always outputs 1^41 hnear functions, the computation can be done in Od^pl^plBI) time, 
when many HpiB updates are strung together. This is a significant complexity reduction 
compared to the exact approach: the latter can lead to a function consisting of |^||ri|l® 
linear functions, which is exponential in the number of observations and in the worst case 
takes 0(|^||5p|ri|l®l) time. 

As HpiB updates are of polynomial complexity one can find the approximation for the 
finite-horizon case efficiently. The open issue remains the problem of finding the solution 
for the infinite-horizon discounted case and its complexity. To address it we establish the 
following theorem. 

Theorem 8 A solution for the fast informed hound approximation can he found hy solving 
an MDP with |S'||74||©| states. \A\ actions and the same discount factor ■-^ . 

The full proof of the theorem is deferred to the Appendix. The key part of the proof 
is the construction of an equivalent MDP with 1511^1161 states representing HpiB updates. 
Since a finite-state MDP can be solved through linear program conversion, the fixed-point 
solution for the fast informed bound update is computable efficiently. 

4.2.1 Fast Informed Bound versus Fully-Observable MDP Approximations 

The fast informed update upper-bounds the exact update and is tighter than both the MDP 
and the QMDP approximation updates. 

Theorem 9 Let Vi corresponds to a piecewise linear convex value function defined hy Fj 
linear functions. Then HVi < HfibVz < HqmdpVi < HMOpVi- 

The key trick in deriving the above result is to swap max and sum operators (the 
proof is in the Appendix) and thus obtain both to the upper-bound inequalities and the 
subsequent reduction in the complexity of update rules compared to the exact update. 
This is also shown in Figure 15. The UMDP approximation, also included in Figure 15, 
is discussed later in Section 4.3. Thus, the difference among the methods boils down to 
simple mathematical manipulations. Note that the same inequality relations as derived for 
updates hold also for their fixed-point solutions (through Theorem 6). 

Figure 13a illustrates the improvement of the bound over MDP-based approximations 
on the Maze20 problem. Note, however, that this improvement is paid for by the increased 
running-time complexity (Figure 13b). 

4.2.2 Control 

The fast informed bound always outputs a piecewise linear and convex function, with one 
linear function per action. This allows us to build a POMDP controller that selects an action 
associated with the best (highest value) linear function directly. Figure 14 compares the 
control performance of the direct and the lookahead controllers to the MDP and the QMDP 
controllers. We see that the fast informed bound leads not only to tighter bounds but also 



60 



Value-Function Approximations for POMDPs 



UMDP update: 

= maxi 53 b(s)p(s,a) + y max 53 53 P{s'\s,a)b(s)aAs') 



exact update: 

V,+i(fo) = max] 5^ fo(s)p(*,a) + 7 5^ max 5L 5^ P {s' ,o\s,a)b{s)ai(s') 

fast informed bound update: < 

V,^.,(fe) = max] 53 ^(•') PC*.") + 753 max 53 ^'(■s'.oU, 0)0,(5') 

QMDP approx. update: ^ 

V,^.,(fe) = max] 53 ^(•s) PC'S.") + 751 P(i'U,a) max a, (*') 

< 



MDP approx. update: 

^,+i(&) = L^(*)max 



p(i,a) + y53 -P(i'U, a) max a,(s') 



Figure 15: Relations between the exact update and the UMDP, the fast informed bound, 
the QMDP and the MDP updates. 



to improved control on average. However, we stress that currently there is no theoretical 
underpinning for this observation and thus it may not be true for all belief states and any 
problem. 

4.2.3 Extensions of the Fast Informed Bound Method 

The main idea of the fast informed bound method is to select the best linear function for 
every observation and every current state separately. This differs from the exact update 
where we seek a linear function that gives the best result for every observation and the 
combination of all states. However, we observe that there is a great deal of middle ground 
between these two extremes. Indeed, one can design an update rule that chooses optimal 
(maximal) linear functions for disjoint sets of states separately. To illustrate this idea, 
assume a partitioning S = {Si, 5*2, • • • , Sm} of the state space 5". The new update for S is: 



Vi+i{b) = max I J2p{s,a)b{s) + 7^ 



max P{s',o\s,a)b{s)ai{s')+ 
max E E P(s', o|s, a)6(s)Q;j(s') + • • • + 

s£b2 s Go 



max E P{s\o\s,a)b{s)ai{s ) 

It is easy to see that the update upper-bounds the exact update. Exploration of this 
approach and various partitioning heuristics remains an interesting open research issue. 
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4.3 Approximation with Unobservable MDP 

The MDP-approximation assumes full observability of POMDP states to obtain simpler 
and more efficient updates. The other extreme is to discard all observations available to 
the decision maker. An MDP with no observations is called unobservable MDP (UMDP) 
and one may choose its value-function solution as an alternative approximation. 

To find the solution for the unobservable MDP, we derive the corresponding update 
rule. Hum DP-, similarly to the update for the partially observable case. Hum dp preserves 
piecewise linearity and convexity of the value function and is a contraction. The update 
equals: 

Vi+i(6) = max < ^p(s, 0)6(5) +7 max ^ ^ P{s'\s,a)b{s)ai{s') 

= {HuMDpVi){b), 

where Fj is a set of linear functions describing Vi. Vi-\.i remains piecewise linear and convex 
and it consists of at most irjU^dl linear functions. This is in contrast to the exact update, 
where the number of possible vectors in the next step can grow exponentially in the number 
of observations and leads to |^||ri|l®l possible vectors. The time complexity of the update is 
0(|^||S'|2|ri|). Thus, starting from Vb with one linear function, the running-time complexity 
for k updates is bounded by 0(|A|'^|5'|^). The problem of finding the optimal solution for the 
unobservable MDP remains intractable: the finite-horizon case is NP-hard(Burago et al., 
1996), and the discounted infinite-horizon case is undecidable (Madani et al., 1999). Thus, 
it is usually not very useful approximation. 

The update Hum dp lower-bounds the exact update, an intuitive result reflecting the 
fact that one cannot do better with less information. To provide some insight into how 
the two updates are related, we do the following derivation, which also proves the bound 
property in an elegant way: 

{HVi){b) = max< ^p(s,a)6(s) +7 ^ max ^ ^P(s',o|s,a)6(s)Q;j(s') 
[ses oe®"''^ ' s'&sses 

> max < ^ a)b{s) + 7 max ^ ^ ^ P{s ■, o|s, a)b{s)ai{s ) 



max < y p{s, a)b{s) + 7 max \^ P(s'|s, a)b{s)ai{s') 
{HuMDpVi){b). 



We see that the difference between the exact and UMDP updates is that the max and 
the sum over next-step observations are exchanged. This causes the choice of a vectors in 
Hum DP to become independent of the observations. Once the sum and max operations are 
exchanged, the observations can be marginalized out. Recall that the idea of swaps leads 
to a number of approximation updates; see Figure 15 for their summary. 
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4.4 Fixed- Strategy Approximations 

A finite-state machine (FSM) model is used primarily to define a control strategy. Such a 
strategy does not require belief state updates since it directly maps sequences of observations 
to sequences of actions. The value function of an FSM strategy is piecewise linear and convex 
and can be found efficiently in the number of memory states (Section 2.6.1). While in the 
policy iteration and policy approximation contexts the value function for a specific strategy 
is used to quantify the goodness of the policy in the first place, the value function alone can 
be also used as a substitute for the optimal value function. In this case, the value function 
(defined over the belief space) equals 

V^lb) = max (a;, 6), 

where V'^' {x^ h) = XIsgs i^: ^)^(^) is obtained by solving a set of IS'UMI linear equations 
(Section 2.6.2). As remarked earlier, the value for the fixed strategy lower-bounds the 
optimal value function, that is F*^ < V* . 

To simplify the comparison of the fixed-strategy approximation to other approximations, 
we can rewrite its solution also in terms of fixed-strategy updates 

Vi+i{h) = ma^<^ ^p(s,?7(a;))6(s) ^ P{o,s'\s,r]{x))h{s)ai{<p{x,o),s ) 



max 



ses 



p{s,r]{x)) +7 X] XI -P(o,^'l^,??(a;))ai(0(a;,o),s') 
oee s'es 



{HpsMVim. 



The value function Vi is piecewise linear and convex and consists of |M| linear functions 
ai{x, .). For the infinite-horizon discounted case ai{x, s) represents the ith approximation of 
V'-^{x, s). Note that the update can be applied to the finite-horizon case in a straightforward 
way. 

4.4.1 Quality of Control 

Assume we have an FSM strategy and would like to use it as a substitute for the optimal 
control policy. There are three different ways in which we can use it to extract the control. 
The first is to simply execute the strategy represented by the FSM. There is no need 
to update belief states in this case. The second possibility is to choose linear functions 
corresponding to different memory states and their associated actions repeatedly in every 
step. We refer to such a controller as a direct (DR) controller. This approach requires 
updating of belief states in every step. On the other hand its control performance is no 
worse than that of the FSM control. The final strategy discards all the information about 
actions and extracts the policy by using the value function V{b) and one-step lookahead. 
This method (LA) requires both belief state updates and lookaheads and leads to the worst 
reactive time. Like DR, however, this strategy is guaranteed to be no worse than the FSM 
controller. The following theorem relates the performances of the three controllers. 
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Figure 16: Comparison of three different controllers (FSM, DR and LA) for the Maze20 
problem and a collection of one-action policies: control quality (left) and re- 
sponse time (right). Error-bars in the control performance graph indicate the 
critical score difference at which any two methods become statistically different 
at significance level 0.05. 



Theorem 10 Let CpsM be an FSM controller. Let Cdr and Cla be the direct and the 
one-step-lookahead controllers constructed based on CpsM- Then V^^^'^{b) < F*"^^(6) and 
yCf-sM (5) < yCi^ f^^i^ j^^ jgi^gj 

Though we can prove that both the direct controller and the lookahead controller are 
always better than the underlying FSM controller (see Appendix for the full proof of the 
theorem), we cannot show the similar property between the first two controllers for all initial 
belief states. However, the lookahead approach typically tends to dominate, reflecting the 
usual trade-off between control quality and response time. We illustrate this trade-off on 
our running Maze20 example and a collection of \A\ one-action policies, each generating a 
sequence of the same action. Control quality and response time results are shown in Figure 
16. We see that the controller based on the FSM is the fastest of the three, but it is also the 
worst in terms of control quality. On the other hand, the direct controller is slower (it needs 
to update belief states in every step) but delivers better control. Finally, the lookahead 
controller is the slowest and has the best control performance. 

4.4.2 Selecting the FSM Model 

The quality of a fixed-strategy approximation depends strongly on the FSM model used. 
The model can be provided a priori or constructed automatically. Techniques for automatic 
construction of FSM policies correspond to a search problem in which either the complete or 
a restricted space of policies is examined to find the optimal or the near-optimal policy for 
such a space. The search process is equivalent to policy approximations or policy-iteration 
techniques discussed earlier in Sections 2.6 and 3.2. 
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4.5 Grid-Based Approximations with Value Interpolation-Extrapolation 

A value function over a continuous belief space can be approximated by a finite set of grid 
points G and an interpolation- extrapolation rule that estimates the value of an arbitrary 
point of the belief space by relying only on the points of the grid and their associated values. 

Definition 8 (Interpolation- extrapolation rule) Let f : I ^ M be a real-valued function 
defined over the information space X, G = {hf , ^2 ■ ■h'^} he a set of grid points and \]/^ = 

{{^1 ^ f i^i)) ^ {^2 ^ f {^2)) ^ " ' A^k ^ f i^k))} ^ of point-value pairs. A function Rc '■ 
I X (I X iZ?)!*^! IR that estimates f at any point of the information space X using only 
values associated with grid points is called an interpolation-extrapolation rule. 

The main advantage of an interpolation-extrapolation model in estimating the true value 

function is that it requires us to compute value updates only for a finite set of grid points 
G. Let Vi be the approximation of the zth value function. Then the approximation for the 
(i + l)th value function Vi+i can be obtained as 

y,+i(6) = i?G(&,*f+i), 

where values associated with every grid point 6^ G G (and included in ^'^i) are: 

<Pi+i{bf) = = max|p(6,a)+7|]P(o|6,a)y,(T(6f,o,a))|^ (9) 

The grid-based update can also be described in terms of a value- function mapping Hq: 
Vi^i = HcVi. The complexity of such an update is 0(|G||74||5'p|B|Cgyg^j(i?G, \G\)) where 
Cgyg^2(i?G, \G\) is the computational cost of evaluating the interpolation-extrapolation rule 
Rg for |G| grid points. We show later (Section 4.5.3), that in some instances, the need to 
evaluate the interpolation-extrapolation rule in every step can be eliminated. 

4.5.1 A Family of Convex Rules 

The number of all possible interpolation-extrapolation rules is enormous. We focus on a 
set of convex rules that is a relatively small but very important subset of interpolation- 
extrapolation rules. 

Definition 9 (Convex rule) Let f be some function defined over the space X, G = {6^,62^,- •• 
be a set of grid points, and *^ = {(6f ,/(6f )), {b2,f{b2)), ■■■ , (^^ ,/(^fc ))} be a set of point- 
value pairs. The rule Rg for estimating f using "^'^ is called convex when for every b ^ X, 
the value f{b) is: 

\G\ 

m = RG{b,^'') = J2\'jf{bj), 

such that < < 1 for every j = 1, • • • , and Yl^j^i Aj = 1. 

20. We note that convex rules used in our work are a special case of averagers introduced by Gordon (1995). 
The difference is minor; the definition of an averager includes a constant (independent of grid points and 
their values) that is added to the convex combination. 
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The key property of convex rules is that their corresponding grid-based update Hq is a 
contraction in the max norm (Gordon, 1995). Thus, the approximate value iteration based 
on Hq converges to the unique fixed-point solution. In addition, Hq based on convex rules 
is isotone. 

4.5.2 Examples of Convex Rules 

The family of convex rules includes approaches that are very commonly used in practice, 
like nearest neighbor, kernel regression, linear point interpolations and many others. 

Take, for example, the nearest-neighbor approach. The function for a belief point b is 
estimated using the value at the grid point closest to it in terms of some distance metric M 
defined over the belief space. Then, for any point 6, there is exactly one nonzero parameter 

= 1 such that || b — b^ I|m<|| t> — bf \\m holds for alH = 1, 2, • • • , A;. All other As are 
zero. Assuming the Euclidean distance metric, the nearest-neighbor approach leads to a 
piecewise constant approximation, in which regions with equal values correspond to regions 
with a common nearest grid point. 

The nearest neighbor estimates the function value by taking into an account only one 
grid point and its value. Kernel regression expands upon this by using more grid points. It 
adds up and weights their contributions (values) according to their distance from the target 
point. For example, assuming Gaussian kernels, the weight for a grid point b^ is 

A,^=/5exp-"^-''fllM/^'^^ 

\G\ h 

where /3 is a normalizing constant ensuring that ZljJi ~ ^ '^^'^ cr is a parameter that 
flattens or narrows weight functions. For the Euclidean metric, the above kernel-regression 
rule leads to a smooth approximation of the function. 

Linear point interpolations are a subclass of convex rules that in addition to constraints 
in Definition 9 satisfy 

|G| 

That is, a belief point 6 is a convex combination of grid points and the As are the corre- 
sponding coefficients. Because the optimal value function for the POMDP is convex, the 
new constraint is sufficient to prove the upper-bound property of the approximation. In 
general, there can be many different linear point-interpolations for a given grid. A challeng- 
ing problem here is to find the rule with the best approximation. We discuss these issues 
in Section 4.5.7. 

4.5.3 Conversion to a Grid-Based MDP 

Assume that we would like to find the approximation of the value function using our grid- 
based convex rule and grid-based update (Equation 9). We can view this process also as 
a process of finding a sequence of values (pi{b^), ^2{bf ), • • • , ipi{bf ), • • • for all grid-points 
bj' G G. We show that in some instances the sequence of values can be computed without 
applying an interpolation-extrapolation rule in every step. In such cases, the problem can 
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be converted into a fully observable MDP with states corresponding to grid-points G. We 
call this MDP a grid-based MDP. 

Theorem 11 Let G be a finite set of grid points and Rq be a convex rule such that param- 
eters are fi,xed. Then the values of (p{b^) for all b^ E G can be found by solving a fully 
observable MDP with \G\ states and the same discount factor ^y. 

Proof For any grid point b^ we can write: 



max \ p{bf, a) + 7 E P{o\bf, a)V^{T{bf, a, a)) 



oee 



max < plb*? , a) + 7 Plolbf, a) 

max^ [p{bf,a)] +^J2^fibk) 
k=l 



\G\ 



Now denoting [J2oe& -^'(^l^ii '^)^Aj'fc] ^i^klbf: o): we can construct a fully observable 
MDP problem with states corresponding to grid points G and the same discount factor 7. 
The update step equals: 

ipi+i{bf) = max \p{bf, a) + 7 E ^(^fc l^f ' ^Wfib'i) 



k=l 



The prerequisite < A^ < 1 for every j = 1, • • • . |G| and I]j=i ^'j ~ ^ guarantees that 
P{bf\b^,a) can be interpreted as true probabilities. Thus, one can compute values ip(b'^) 
by solving the equivalent fully-observable MDP. □ 



4.5.4 Solving Grid-Based Approximations 

The idea of converting a grid-based approximation into a grid-based MDP is a basis of 
our simple but very powerful approximation algorithm. Briefly, the key here is to find 
the parameters (transition probabilities and rewards) of a new MDP model and then solve 
it. This process is relatively easy if the parameters A used to interpolate-extrapolate the 
value of a non-grid point are fixed (the assumption of Theorem 11). In such a case, we 
can determine parameters of the new MDP efficiently in one step, for any grid set G. The 
nearest neighbor or the kernel regression are examples of rules with this property. Note that 
this leads to polynomial-time algorithms for finding values for all grid points (recall that an 
MDP can be solved efficiently for both finite and discounted, infinite- horizon criteria). 

The problem in solving grid-based approximation arises only when the parameters A 
used in the interpolation-extrapolation are not fixed and are subject to the optimization 
itself This happens, for example, when there are multiple ways of interpolating a value 

21. We note that a similar result has been also proved independently by Gordon (1995). 
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at some point of the belief space and we would like to find the best interpolation (leading 
to the best values) for all grid points in G. In such a case, the corresponding "optimal" 
grid-based MDP cannot be found in a single step and iterative approximation, solving a 
sequence of grid-based MDPs, is usually needed. The worst-case complexity of this problem 
remains an open question. 

4.5.5 Constructing Grids 

An issue we have not touched on so far is the selection of grids. There are multiple ways to 
select grids. We divide them into two classes - regular and non-regular grids. 

Regular grids (Lovejoy, 1991a) partition the belief space evenly into equal-size regions. 
The main advantage of regular grids is the simplicity with which we can locate grid points 
in the neighborhood of any belief point. The disadvantage of regular grids is that they 
are restricted to a specific number of points, and any increase in grid resolution is paid for 
in an exponential increase in the grid size. For example, a sequence of regular grids for a 
20-dimensional belief space (corresponds to a POMDP with 20 states) consists of 20, 210, 
1540, 8855, 42504, • • • grid points. This prevents one from using the method with higher 
grid resolutions for problems with larger state spaces. 

Non-regular grids are unrestricted and thus provide for more flexibility when grid reso- 
lution must be increased adaptively. On the other hand, due to irregularities, methods for 
locating grid points adjacent to an arbitrary belief point are usually more complex when 
compared to regular grids. 

4.5.6 Linear Point Interpolation 

The fact that the optimal value function V* is convex for a belief-state MDPs can be used 
to show that the approximation based on linear point interpolation always upper-bounds 
the exact solution (Lovejoy, 1991a, 1993). Neither kernel regression nor nearest neighbor 
can guarantee us any bound. 

Theorem 12 (Upper hound property of a grid-based point interpolation update). Let Vi he 
a convex value function. Then HVi < HgVi. 

The upper-bound property of Hq update for convex value functions follows directly 
from Jensen's inequality. The convergence to an upper-bound follows from Theorem 6. 

Note that the point-interpolation update imposes an additional constraint on the choice 
of grid points. In particular, it is easy to see that any valid grid must also include ex- 
treme points of the belief simplex (extreme points correspond to (1, 0, 0, • • •), (0, 1, 0, • • •), 

22. Regular grids used by Lovejoy (1991a) are based on Frcudcnthal triangulation (Eaves, 1984). Essen- 
tially, this is the same idea as used to partition evenly the n-dimensional subspace of J?". In fact, an 
afRne transform allows us to map isomorphically grid points in the belief space to grid points in the 
n-dimensional space (Lovejoy, 1991a). 

23. The number of points in the regular grid sequence is given by (Lovejoy, 1991a): 

(M+\S\-1)\ 
' ' M\{\S\ - 1)! ' 

where M = 1, 2, • • • is a grid refinement parameter. 
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etc.). Without extreme points one would be unable to cover the whole belief space via 
interpolation. Nearest neighbor and kernel regression impose no restrictions on the grid. 

4.5.7 Finding the best interpolation 

In a general, there are multiple ways to interpolate a point of a belief space. Our objective 
is to find the best interpolation, that is, the one that leads to the tightest upper bound of 
the optimal value function. 

Let 6 be a belief point and {{bj , f {bj))\bj G G} a set of grid- value pairs. Then the best 
interpolation for point b is: 

\G\ 

/(6) = mm^A,-/(6,) 

subject to < Aj < 1 for all j = 1, • • • , Z)j=i = li b = Z)j=i ^jbf- 

This is a linear optimization problem. Although it can be solved in polynomial time 
(using linear programming techniques), the computational cost of doing this is still relatively 
large, especially considering the fact that the optimization must be repeated many times. 
To alleviate this problem we seek more efficient ways of finding the interpolation, sacrificing 
the optimality. 

One way to find a (suboptimal) interpolation quickly is to apply regular grids proposed 
by Lovejoy (1991a). In this case the value at a belief point is approximated using the 
convex combination of grid points closest to it. The approximation leads to piecewise linear 
and convex value functions. As all interpolations are fixed here, the problem of finding 
the approximation can be converted into an equivalent grid-based MDP and solved as a 
finite-state MDP. However, as pointed in the previous section, the regular grids must use a 
specific number of grid points and any increase in the resolution of a grid is paid for by an 
exponential increase in the grid size. This feature makes the method less attractive when 
we have a problem with a large state space and we need to achieve high grid resolution.^'^ 

In the present work we focus on non-regular (or arbitrary) grids. We propose an inter- 
polation approach that searches a limited space of interpolations and is guaranteed to run 
in time linear in the size of the grid. The idea of the approach is to interpolate a point 
6 of a belief space of dimension l^l with a set of grid points that consists of an arbitrary 
grid point b' E G and IS"! — 1 extreme points of the belief simplex. The coefficients of this 
interpolation can be found efficiently and we search for the best such interpolation. Let 
6' G G be a grid point defining one such interpolation. Then the value at point b satisfies 

y,(6) = rniny/'(6), 

GCr 

where V-'' is the value of the interpolation for the grid point b'. Figure 17 illustrates the 
resulting approximation. The function is characterized by its "sawtooth" shape, which is 
influenced by the choice of the interpolating set. 

To find the best value-function solution or its close approximation we can apply a value 
iteration procedure in which we search for the best interpolation after every update step. 

24. One solution to this problem may be to use adaptive regular grids in which grid resolution is increased 
only in some parts of the belief space. We leave this idea for future work. 
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Figure 17: Value- function approximation based on the linear-time interpolation approach 
(a two-dimensional case). Interpolating sets are restricted to a single internal 
point of the belief space. 



The drawback of this approach is that interpolations may remain unchanged for many 
update steps, thus slowing down the solution process. An alternative approach is to solve 
a sequence of grid-based MDPs instead. In particular, at every stage we find the best 
(minimum value) interpolations for all belief points reachable from grid points in one step, fix 
coefficients of these interpolations (As), construct a grid-based MDP and solve it (exactly or 
approximately). This process is repeated until no further improvement (or no improvement 
larger than some threshold) is seen in values at different grid points. 

4.5.8 Improving Grids Adaptively 

The quality of an approximation (bound) depends strongly on the points used in the grid. 
Our objective is to provide a good approximation with the smallest possible set of grid 
points. However, this task is impossible to achieve, since it cannot be known in advance 
(before solving) what belief points to pick. A way to address this problem is to build grids 
incrementally, starting from a small set of grid points and adding others adaptively, but 
only in places with a greater chance of improvement. The key part of this approach is a 
heuristic for choosing grid points to be added next. 

One heuristic method we have developed attempts to maximize improvements in bound 
values via stochastic simulations. The method builds on the fact that every interpolation 
grid must also include extreme points (otherwise we cannot cover the entire belief space). 
As extreme points and their values affect the other grid points, we try to improve their 
values in the first place. In general, a value at any grid point b improves more the more 
precise values are used for its successor belief points, that is, belief states that correspond 
to t(6, a*,o) for a choice of observation o. a* is the current optimal action choice for b. 
Incorporating such points into the grid then makes a larger improvement in the value at 
the initial grid point b more likely. Assuming that our initial point is an extreme point, we 
have a heuristic that tends to improve a value for that point. Naturally, one can proceed 
further with this selection by incorporating the successor points for the first-level successors 
into the grid as well, and so forth. 
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generate new grid points (G, V^) 

set Gfiew ~ {} 
for all extreme points h do 
repeat until b ^ G U Gnew 

set a* = argmaxa a) + ^T^ogo Pio\b,a)V'^{T{b,a,oj)^ 
select observation o according to P{o\b,a*) 
update b = T{b, a*,o) 
add b into Gnew 
return Gnew 



Figure 18: Procedure for generating additional grid points based on our bound improve- 
ment heuristic. 
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Figure 19: Improvement in the upper bound quality for grid-based point-interpolations 
based on the adaptive-grid method. The method is compared to randomly 
refined grid and the regular grid with 210 points. Other upper-bound approxi- 
mations (the MDP, QMDP and fast informed bound methods) are included for 
comparison. 



To capture this idea, we generate new grid points via simulation, starting from one 
of the extremes of the belief simplex and continuing until a belief point not currently in 
the grid is reached. An algorithm that implements the bound improvement heuristic and 
expands the current grid G with a set of \S\ new grid points while relying on the current 
value-function approximation is shown in Figure 18. 

Figure 19 illustrates the performance (bound quality) of our adaptive grid method on 
the Maze20 problem. Here we use the combination of adaptive grids with our linear-time 
interpolation approach. The method gradually expands the grid in 40 point increments up to 
400 grid points. Figure 19 also shows the performance of the random-grid method in which 
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Figure 20: Running times of grid-based point-interpolation methods. Methods tested in- 
clude the adaptive grid, the random grid, and the regular grid with 210 grid 
points. Running times for the adaptive-grid are cumulative, reflecting the de- 
pendencies of higher grid resolutions on the lower-level resolutions. The running 
time results for the MDP, QMDP, and fast informed bound approximations are 
shown for comparison. 



new points of the grid are selected iniformly at random (results for 40 grid point increments 
are shown). In addition, the figure gives results for the regular grid interpolation (based 
on Lovejoy (1991a)) with 210 belief points and other upper-bound methods: the MDP, the 
QMDP and the fast informed bound approximations. 

We see a dramatic improvement in the quality of the bound for the adaptive method. 
In contrast to this, the uniformly sampled grid (random-grid approach) hardly changes the 
bound. There are two reasons for this: (1) uniformly sampled grid points are more likely to 
be concentrated in the center of the belief simplex; (2) the transition matrix for the Maze20 
problem is relatively sparse, the belief points one obtains from the extreme points in one 
step are on the boundary of the simplex. Since grid points in the center of the simplex 
are never used to interpolate belief states reachable from extremes in one step they cannot 
improve values at extremes and the bound does not change. 

One drawback of the adaptive method is its running time (for every grid size we need 
to solve a sequence of grid-based MDPs) . Figure 20 compares running times of different 
methods on the Maze20 problem. As grid-expansion of the adaptive method depends on 
the value function obtained for previous steps, we plot its cumulative running times. We 
see a relatively large increase in running time, especially for larger grid sizes, reflecting 
the trade-off between the bound quality and running time. However, we note that the 
adaptive-grid method performs quite well in the initial few steps, and with only 80 grid 
points outperforms the regular grid (with 210 points) in bound quality. 

Finally, we note that other heuristic approaches to constructing adaptive grids for point 
interpolation are possible. For example, a different approach that reflnes the grid by ex- 
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Figure 21: Control performance of lookahead controllers based on grid-based point inter- 
polation and nearest neighbor methods and varying grid sizes. The results are 
compared to the MDP, the QMDP and the fast informed bound controllers. 



amining differences in values at current grid points has recently been proposed by Brafman 
(1997). 

4.5.9 Control 

Value functions obtained for different grid-based methods define a variety of controllers. Fig- 
ure 21 compares the performances of lookahead controllers based on the point-interpolation 
and nearest-neighbor methods. We run two versions of both approaches, one with the adap- 
tive grid, the other with the random grid, and we show results obtained for 40, 200 and 400 
grid points. In addition, we compare their performances to the interpolation with regular 
grids (with 210 grid points), the MDP, the QMDP and the fast informed bound approaches. 

Overall, the performance of the interpolation-extrapolation techniques we tested on 
the Maze20 problem was a bit disappointing. In particular, better scores were achieved 
by the simpler QMDP and fast informed bound methods. We see that, although heuristics 
improved the bound quality of approximations, they did not lead to the similar improvement 
over the QMDP and the fast informed bound methods in terms of control. This result 
shows that a bad bound (in terms of absolute values) does not always imply bad control 
performance. The main reason for this is that the control performance is influenced mostly 
by relative rather than absolute value-function values (or, in other words, by the shape 
of the function). All interpolation-extrapolation techniques we use (except regular grid 
interpolation) approximate the value function with functions that are not piecewise linear 
and convex; the interpolations are based on the linear-time interpolation technique with a 
sawtooth-shaped function, and the nearest-neighbor leads to a piecewise-constant function. 
This does not allow them to match the shape of the optimal function correctly. The other 
factor that affects the performance is a large sensitivity of methods to the selection of grid 
points, as documented, for example, by the comparison of heuristic and random grids. 
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In the above tests we focused on lookahead controllers only. However, an alternative way 
to define a controller for grid-based interpolation-extrapolation methods is to use Q-function 
approximations instead of value functions, and either direct or lookahead designs. Q- 
function approximations can be found by solving the same grid-based MDP, and by keeping 
values (functions) for different actions separate at the end. 

4.6 Approximations of Value Functions Using Curve Fitting (Least-Squares 
Fit) 

An alternative way to approximate a function over a continuous space is to use curve-fitting 
techniques. This approach relies on a predefined parametric model of the value function 
and a set of values associated with a finite set of (grid) belief points G. The approach 
is similar to interpolation-extrapolation techniques in that it relies on a set of belief-value 
pairs. The difference is that the curve fitting, instead of remembering all belief-value pairs, 
tries to summarize them in terms of a given parametric function model. The strategy seeks 
the best possible match between model parameters and observed point values. The best 
match can be defined using various criteria, most often the least-squares fit criterion, where 
the objective is to minimize 

j 

Here bj and yj correspond to the belief point and its associated value. The index j ranges 
over all points of the sample set G. 

4.6.1 Combining Dynamic Programming and Least-Squares Fit 

The least-squares approximation of a function can be used to construct a dynamic-programming 
algorithm with an update step: Fi+i = i^LSF^- The approach has two steps. First, we 
obtain new values for a set of sample points G: 

^i+i(6) = {HVi){b) = max|^p(5,a)5(5) + 7E E^W^'«)^(^)^i(^(^'«''')) 

Second, we fit the parameters of the value- function model Fj+i using new sample- value pairs 
and the square-error cost function. The complexity of the update is O(|G||A||5'p|0|Cg^g^j(Kj)+ 
Cpj|-(Vj+i, |G|)) time, where C-^^g^-^{Vi) is the computational cost of evaluating Vi and 
Cpj^(Fi+i, \G\) is the cost of fitting parameters of F^+i to |G| belief-value pairs. 

The advantage of the approximation based on the least-squares fit is that it requires us 
to compute updates only for the finite set of belief states. The drawback of the approach 
is that, when combined with the value- iteration method, it can lead to instability and/or 
divergence. This has been shown for MDPs by several researchers (Bertsekas, 1994; Boyan 
& Moore, 1995; Baird, 1995; Tsitsiklis & Roy, 1996). 

25. This is similar to the QMDP method, which allows both lookahead and greedy designs. In fact, QMDP 
can be viewed as a special case of the grid-based method with Q-function approximations, where grid 
points correspond to extremes of the belief simplex. 
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4.6.2 On-line Version of the Least-Squares Fit 

The problem of finding a set of parameters with the best fit can be solved by any available 
optimization procedure. This includes the on-line (or instance-based) version of the gradient 
descent method, which corresponds to the well-known delta rule (Rumelhart, Hinton, & 
Williams, 1986). 

Let / denote a parametric value function over the belief space with adjustable weights 
w = {wi,W2, • • • , Wk}. Then the on-line update for a weight Wi is computed as: 

df 

Wi^Wi- ai{f{bj) - y3)^\bp 

where ai is a learning constant, and hj and yj are the last-seen point and its value. Note 
that the gradient descent method requires the function to be differentiable with regard to 
adjustable weights. 

To solve the discounted infinite-horizon problem, the stochastic (on-line) version of a 
least-squares fit can be combined with either parallel (synchronous) or incremental (Gauss- 
Seidel) point updates. In the first case, the value function from the previous step is fixed 
and a new value function is computed from scratch using a set of belief point samples and 
their values computed through one-step expansion. Once the parameters are stabilized (by 
attenuating learning rates), the newly acquired function is fixed, and the process proceeds 
with another iteration. In the incremental version, a single value-function model is at the 
same time updated and used to compute new values at sampled points. Littman et al. (1995) 
and Parr and Russell (1995) implement this approach using asynchronous reinforcement 
learning backups in which sample points to be updated next are obtained via stochastic 
simulation. We stress that all versions are subject to the threat of instability and divergence, 
as remarked above. 

4.6.3 Parametric Function Models 

To apply the least-squares approach we must first select an appropriate value function 
model. Examples of simple convex functions are linear or quadratic functions, but more 
complex models are possible as well. 

One interesting and relatively simple approach is based on the least-squares approx- 
imation of linear action- value functions (Q-functions) (Littman et al., 1995). Here the 
value function Vj+i is approximated as a piecewise linear and convex combination of Qi+i 
functions: 

Vi+i{h) = maxQi+i(6,a), 

where Qi+i(fo, a) is the least-squares fit of a linear function for a set of sample points G. 
Values at points in G are obtained as 

fi+M = P{b, a) + 7 E ^H^' m{T{b, o, a)). 

The method leads to an approximation with |^| linear functions and the coefficients of these 
functions can be found efficiently by solving a set of linear equations. Recall that other two 
approximations (the QMDP and the fast informed bound approximations) also work with 
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1^1 linear functions. The main differences between the methods are that the QMDP and 
fast informed bound methods update hnear functions directly, and they guarantee upper 
bounds and unique convergence. 

A more sophisticated parametric model of a convex function is the softmax model (Parr 
& Russell, 1995): 







k' 


E 













where T is the set of linear functions a with adaptive parameters to fit and A; is a "tempera- 
ture" parameter that provides a better fit to the underlying piecewise linear convex function 
for larger values. The function represents a soft approximation of a piecewise linear convex 
function, with the parameter k smoothing the approximation. 



4.6.4 Control 

We tested the control performance of the least-squares approach on the linear Q-function 
model (Liftman et al., 1995) and the softmax model (Parr & Russell, 1995). For the softmax 
model we varied the number of linear functions, trying cases with 10 and 15 linear functions 
respectively. In the first set of experiments we used parallel (synchronous) updates and 
samples at a fixed set of 100 belief points. We applied stochastic gradient descent techniques 
to find the best fit in both cases. We tested the control performance for value-function 
approximations obtained after 10, 20 and 30 updates, starting from the QMDP solution. In 
the second set of experiments, we applied the incremental stochastic update scheme with 
Gauss-Seidel-style updates. The results for this method were acquired after every grid point 
was updated 150 times, with learning rates decreasing linearly in the range between 0.2 and 
0.001. Again we started from the QMDP solution. The results for lookahead controllers are 
summarized in Figure 22, which also shows the control performance of the direct Q-function 
controller and, for comparison, the results for the QMDP method. 

The linear-Q function model performed very well and the results for the lookahead design 
were better than the results for the QMDP method. The difference was quite apparent for 
direct approaches. In general, the good performance of the method can be attributed to 
the choice of a function model that let us match the shape of the optimal value function 
reasonably well. In contrast, the softmax models (with 10 and 15 linear functions) did not 
perform as expected. This is probably because in the softmax model all linear functions are 
updated for every sample point. This leads to situations in which multiple linear functions 
try to track a belief point during its update. Under these circumstances it is hard to capture 
the structure of the optimal value function accurately. The other negative feature is that 
the effects of on-line changes of all linear functions are added in the softmax approximation, 
and thus could bias incremental update schemes. In the ideal case, we would like to identify 
one vector a responsible for a specific belief point and update (modify) only that vector. 
The linear Q-function approach avoids this problem by always updating only a single linear 
function (corresponding to an action). 
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Figure 22: Control performance of least-squares fit methods. Models tested include: linear 
Q-function model (with both direct and lookahead control) and softmax mod- 
els with 10 and 15 linear functions (lookahead control only). Value functions 
obtained after 10, 20 and 30 synchronous updates and value functions obtained 
through the incremental stochastic update scheme are used to define different 
controllers. For comparison, we also include results for two QMDP controllers. 



4.7 Grid-Based Approximations with Linear Function Updates 

An alternative grid-based approximation method can be constructed by applying Sondik's 
approach for computing derivatives (linear functions) to points of the grid (Lovejoy, 1991a, 
1993). Let Vi be a piecewise linear convex function described by a set of linear functions Fj. 
Then a new linear function for a belief point b and an action a can be computed efficiently 
as (Smallwood & Sondik, 1973; Liftman, 1996) 



a 



b,a I 



(10) 



where t(6, a, o) indexes a linear function in a set of linear functions Fj (defining Vi) that 
maximizes the expression 



E 

s'gS 



J2P{s',o\s,a)b{s) 

.ses 



ai(s') 



for a fixed combination of 6, a, o. The optimizing function for b is then acquired by choosing 
the vector with the best overall value from all action vectors. That is, assuming Ff_,_^ is a 
set of all candidate linear functions, the resulting functions satisfies 



a, 



6,* 



= arg max J2 ■ 



ses 



A collection of linear functions obtained for a set of belief points can be combined into 
a piecewise linear and convex value function. This is the idea behind a number of exact 
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Figure 23: An incremental version of the grid-based linear function method. The piecewise 
linear lower bound is improved by a new linear function computed for a belief 
point h using Sondik's method. 



algorithms (see Section 2.4.2). However, in the exact case, a set of points that cover all 
linear functions defining the new value function must be located first, which is a hard task 
in itself. In contrast, the approximation method uses an incomplete set of belief points that 
are fixed or at least easy to locate, for example via random or heuristic selection. We use 
B.GL to denote the value-function mapping for the grid approach. 

The advantage of the grid-based method is that it leads to more efficient updates. The 
time complexity of the update is polynomial and equals 0(|G'||yl||S'p|©|). It yields a set of 
\G\ linear functions, compared to |^||ri|l®l possible functions for the exact update. 

Since the set of grid-points is incomplete, the resulting approximation lower-bounds the 
value function one would obtain by performing the exact update (Lovejoy, 1991a). 

Theorem 13 (Lower-bound property of the grid-based linear function update). Let Vi be a 
piecewise linear value function and G a set of grid points used to compute linear function 
updates. Then HglVi < HVi. 

4.7.1 Incremental Linear-Function Approach 

The drawback of the grid-based linear function method is that Hgl is not a contraction 
for the discounted infinite-horizon case, and therefore the value iteration method based on 
the mapping may not converge (Lovejoy, 1991a). To remedy this problem, we propose an 
incremental version of the grid-based linear function method. The idea of this refinement is 
to prevent instability by gradually improving the piecewise linear and convex lower bound 
of the value function. 

Assume that Vi < y* is a convex piecewise linear lower bound of the optimal value 
function defined by a linear function set Fj, and let be a linear function for a point b 
that is computed from Vi using Sondik's method. Then one can construct a new improved 
value function Fj+i > by simply adding the new linear function into Fj. That is: 
Fj_|_i = Fj U aft. The idea of the incremental update, illustrated in Figure 23, is similar 
to incremental methods used by Cheng (1988) and Lovejoy (1993). The method can be 
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Figure 24: Bound quality and running times of the standard and incremental version of 
the grid-based linear-function method for the fixed 40-point grid. Cumulative 
running times (including all previous update cycles) are shown for both methods. 
Running times of the QMDP and the fast informed bound methods are included 
for comparison. 



extended to handle a set of grid points G in a straightforward way. Note also that after 
adding one or more new linear functions to Fj, some of the previous linear functions may 
become redundant and can be removed from the value function. Techniques for redundancy 
checking are the same as are applied in the exact approaches (Monahan, 1982; Eagle, 1984). 

The incremental refinement is stable and converges for a fixed set of grid points. The 
price paid for this feature is that the linear function set Fj can grow in size over the iteration 
steps. Although the growth is at most linear in the number of iterations, compared to 
the potentially exponential growth of exact methods, the linear function set describing 
the piecewise linear approximation can become huge. Thus, in practice we usually stop 
incremental updates well before the method converges. The question that remains open is 
the complexity (hardness) of the problem of finding the fixed-point solution for a fixed set 
of grid points G. 

Figure 24 illustrates some of the trade-offs involved in applying incremental updates 
compared to the standard fixed-grid approach on the Maze20 problem. We use the same 
grid of 40 points for both techniques and the same initial value function. Results for 1-10 
update cycles are shown. We see that the incremental method has longer running times 
than the standard method, since the number of linear functions can grow after every update. 
On the other hand, the bound quality of the incremental method improves more rapidly 
and it can never become worse after more update steps. 

4.7.2 Minimum Expected Reward 

The incremental method improves the lower bound of the value function. The value func- 
tion, say Vi, can be used to create a controller (with either the lookahead or direct-action 
choice). In the general case, we cannot say anything about the performance quality of 
such controllers with regard to Vi. However, under certain conditions the performance of 
both controllers is guaranteed never to fall below Vi. The following theorem (proved in the 
Appendix) establishes these conditions. 

Theorem 14 Let Vi be a value function obtained via the incremental linear function method, 
starting from Vq, which corresponds to some fixed strategy Cq. Let CLA,i o-nd Cdr.i be two 
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controllers based on Vi: the lookahead controller and the direct action controller, and V'-'^^''' , 
yCDR,i f)g their respective value functions. Then Vi < V'-^^^-' and Vi < y<^Dfl,i hold. 

We note that the same property holds for the incremental version of exact value iteration. 
That is, both the lookahead and the direct controllers perform no worse than Vi obtained 
after i incremental updates from some Vq corresponding to a FSM controller Cq. 

4.7.3 Selecting Grid Points 

The incremental version of the grid-based linear-function approximation is flexible and 
works for an arbitrary grid.^^ Moreover, the grid need not be fixed and can be changed on 
line. Thus, the problem of finding grids reduces to the problem of selecting belief points to 
be updated next. One can apply various strategies to do this. For example, one can use a 
fixed set of grid points and update them repeatedly, or one can select belief points on line 
using various heuristics. 

The incremental linear function method guarantees that the value function is always 
improved (all linear functions from previous steps are kept unless found to be redundant). 
The quality of a new linear function (to be added next) depends strongly on the quality of 
linear functions obtained in previous steps. Therefore, our objective is to select and order 
points with better chances of larger improvement. To do this we have designed two heuristic 
strategies for selecting and ordering belief points. 

The first strategy attempts to optimize updates at extreme points of the belief simplex 
by ordering them heuristically. The idea of the heuristic is based on the fact that states 
with higher expected rewards (e.g. some designated goal states) backpropagate their effects 
(rewards) locally. Therefore, it is desirable that states in the neighborhood of the highest 
reward state be updated first, and the distant ones later. We apply this idea to order 
extreme points of the belief simplex, relying on the current estimate of the value function 
to identify the highest expected reward states and on a POMDP model to determine the 
neighbor states. 

The second strategy is based on the idea of stochastic simulation. The strategy generates 
a sequence of belief points more likely to be reached from some (fixed) initial belief point. 
The points of the sequence are then used in reverse order to generate updates. The intent 
of this heuristic is to "maximize" the improvement of the value function at the initial fixed 
point. To run this heuristic, we need to find an initial belief point or a set of initial belief 
points. To address this problem, we use the first heuristic that allows us to order the 
extreme points of the belief simplex. These points are then used as initial beliefs for the 
simulation part. Thus, we have a two-tier strategy: the top-level strategy orders extremes 
of the belief simplex, and the lower-level strategy applies stochastic simulation to generate 
a sequence of belief states more likely reachable from a specific extreme point. 

We tested the order heuristics and the two-tier heuristics on our Maze20 problem, and 
compared them also to two simple point selection strategies: the fixed-grid strategy, in 
which a set of 40 grid points was updated repeatedly, and the random-grid strategy, in 
which points were always chosen uniformly at random. Figure 25 shows the bound quality 

26. There is no restriction on the grid points that must be included in the grid, such as was required for 
example in the linear point-interpolation scheme, which had to use all extreme points of the belief 
simplex. 
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Figure 25: Improvements in the bound quality for the incremental linear-function method 
and four different grid-selection heuristics. Each cycle includes 40 grid-point 
updates. 



of the methods for 10 update cycles (each cycle consists of 40 grid point updates) on the 
Maze20 problem. We see that the differences in the quality of value- function approximations 
for different strategies (even the very simple ones) are relatively small. We note that we 
observed similar results also for other problems, not just Maze20. 

The relatively small improvement of our heuristics can be explained by the fact that 
every new linear function influences a larger portion of the belief space and thus the method 
should be less sensitive to a choice of a specific point. However, another plausible explana- 
tion is that our heuristics were not very good and more accurate heuristics or combinations 
of heuristics could be constructed. Efficient strategies for locating grid points used in some 
of the exact methods, e.g. the Witness algorithm (Kaelbling et al., 1999) or Cheng's meth- 
ods (Cheng, 1988) can potentially be applied to this problem. This remains an open area 
of research. 

4.7.4 Control 

The grid-based linear-function approach leads to a piecewise linear and convex approxi- 
mation. Every linear function comes with a natural action choice that lets us choose the 
action greedily. Thus we can run both the lookahead and the direct controllers. Figure 26 
compares the performance of four different controllers for the fixed grid of 40 points, com- 
bining standard and incremental updates with lookahead and direct greedy control after 1, 
5 and 10 update cycles. The results (see also Figure 24) illustrate the trade-offs between the 
computational time of obtaining the solution and its quality. We see that the incremental 
approach and the lookahead controller design tend to improve the control performance. The 
prices paid are worse running and reaction times, respectively. 

27. The small sensitivity of the incremental method to the selection of grid points would suggest that one 
could, in many instances, replace exact updates with simpler point selection strategies. This could 
increase the speed of exact value-iteration methods (at least in their initial stages), which suffer from 
inefficiencies associated with locating a complete set of grid points to be updated in every step. However, 
this issue needs to be investigated. 
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Figure 26: Control performance of four different controllers based on grid-based linear func- 
tion updates after 1, 5 and 10 update cycles for the same 40-point grid. Con- 
trollers represent combinations of two update strategies (standard and incre- 
mental) and two action-extraction techniques (direct and lookahead). Running 
times for the two update strategies were presented in Figure 24. For compar- 
ison we include also performances of the QMDP and the fast informed bound 
methods (with both direct and lookahead designs). 
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Figure 27: Control performances of lookahead controllers based on the incremental linear- 
function approach and different point-selection heuristics after 1, 5 and 10 im- 
provement cycles. For comparison, scores for the QMDP and the fast informed 
bound approximations are shown as well. 



Figure 27 illustrates the effect of point selection heuristics on control. We compare the 
results for lookahead control only, using approximations obtained after 1, 5 and 10 improve- 
ment cycles (each cycle consists of 40 grid point updates). The test results show that, as 
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for the bound quality, there are no big differences among various heuristics, suggesting a 
small sensitivity of control to the selection of grid points. 

4.8 Summary of Value-Function Approximations 

Heuristic value- function approximations methods allow us to replace hard-to-compute exact 
methods and trade off solution quality for speed. There are numerous methods we can em- 
ploy, each with different properties and different trade-offs of quality versus speed. Tables 1 
and 2 summarize main theoretical properties of the approximation methods covered in this 
paper. The majority of these methods are of polynomial complexity or at least have effi- 
cient (polynomial) Bellman updates. This makes them good candidates for more complex 
POMDP problems that are out of reach of exact methods. 

All of the methods are heuristic approximations in that they do not give solutions of a 
guaranteed precision. Despite this fact we proved that solutions of some of the methods are 
no worse than others in terms of value function quality (see Figure 15). This was one of the 
main contributions of the paper. However, there are currently minimal theoretical results 
relating these methods in terms of control performance; the exception are some results 
for FSM-controllers and FSM-based approximations. The key observation here is that for 
the quality of control (lookahead control) it is more important to approximate the shape 
(derivatives) of the value function correctly. This is also illustrated empirically on grid- 
based interpolation-extrapolation methods in Section 4.5.9 that are based on non-convex 
value functions. The main challenges here are to find ways of analyzing and comparing 
control performance of different approximations also theoretically and to identify classes of 
POMDPs for which certain methods dominate the others. 

Finally, we note that the list of methods is not complete and other value- function approx- 
imation methods or the refinements of existing methods are possible. For example. White 
and Scherer (1994) investigate methods based on truncated histories that lead to upper 
and lower bound estimates of the value function for complete information states (complete 
histories). Also, additional restrictions on some of the methods can change the properties 
of a more generic method. For example, it is possible that under additional assumptions 
we will be able to ensure convergence of the least-squares fit approximation. 

5. Conclusions 

POMDPs offers an elegant mathematical framework for representing decision processes 
in stochastic partially observable domains. Despite their modeling advantages, however, 
POMDP problems are hard to solve exactly. Thus, the complexity of problem solving- 
procedures becomes the key aspect in the sucessful application of the model to real-world 
problems, even at the expense of the optimality. As recent complexity results for the 
approximability of POMDP problems are not encouraging (Lusena et al., 1998; Madani 
et al., 1999), we focus on heuristic approximations, in particular approximations of value 
functions. 
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Grid-based linear function method 


lower 






Incremental version (start from a lower bound) 


lower 




* 



Table 1: Properties of different value-function approximation methods: bound property, 
isotonicity and contraction property of the underlying mappings for < 7 < 1. 
(*) Although incremental version of the grid-based linear-function method is not 
a contraction it always converges. 



Method 


Finite-horizon 


Discounted infinite-horizon 


MDP approximation 


P 


P 


QMDP approximation 


P 


P 


Fast informed bound 


P 


P 


UMDP approximation 


NP-hard 


undecidable 


Fixed-strategy method 


P 


P 


Grid-based interpolation-extrapolation 


varies 


NA 


Nearest neighbor 


P 


P 


Kernel regression 


P 


P 


Linear point interpolation 


P 


varies 


Fixed interpolation 


P 


P 


Bost intorpolation 


P 


? 


Curve-fitting (least-squares fit) 


varies 


NA 


linear Q-function 


P 


NA 


Grid-based linear function method 


P 


NA 


Incremental version 


NA 


? 



Table 2: Complexity of value-function approximation methods for finite-horizon problem 
and discounted infinite-horizon problem. The objective for the discounted infinite- 
horizon case is to find the corresponding fixed-point solution. The complexity 
results take into account, in addition to components of POMDPs, also all other 
approximation specific parameters, e.g., the size of the grid G in grid-based meth- 
ods. ? indicates open instances and NA methods that are not applicable to one 
of the problems (e.g. because of possible divergence). 
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5.1 Contributions 

The paper surveys new and known value-function approximation methods for solving POMDPs. 
We focus primarily on the theoretical analysis and comparison of the methods, with find- 
ings and results supported experimentally on a problem of moderate size from the agent 
navigation domain. We analyze the methods from different perspectives: their computa- 
tional complexity, capability to bound the optimal value function, convergence properties of 
iterative implementations, and the quality of derived controllers. The analysis includes new 
theoretical results, deriving the properties of individual approximations, and their relations 
to exact methods. In general, the relations between and trade-offs among different methods 
are not well understood. We provide some new insights on these issues by analyzing their 
corresponding updates. For example, we showed that the differences among the exact, the 
MDP, the QMDP, the fast-informed bound, and the UMDP methods boil down to simple 
mathematical manipulations and their subsequent effect on the value-function approxima- 
tion. This allowed us to determine relations among different methods in terms of quality of 
their respective value functions which is one of the main results of the paper. 

We also presented a number of new methods and heuristic refinements of some existing 
techniques. The primary contributions in this area include the fast-informed bound, grid- 
based point interpolation methods (including adaptive grid approaches based on stochas- 
tic sampling), and the incremental linear- function method. We also showed that in some 
instances the solutions can be obtained more efficiently by converting the original approx- 
imation into an equivalent finite-state MDP. For example, grid-based approximations with 
convex rules can be often solved via conversion into a grid-based MDP (in which grid points 
correspond to new states), leading to the polynomial-complexity algorithm for both the fi- 
nite and the discounted infinite-horizon cases (Section 4.5.3). This result can dramatically 
improve the run-time performance of the grid-based approaches. A similar conversion to 
the equivalent finite-state MDP, allowing a polynomial-time solution for the discounted 
infinite- horizon problem, was shown for the fast informed bound method (Section 4.2). 

5.2 Challenges and Future Directions 

Work on POMDPs and their approximations is far from complete. Some complexity results 
remain open, in particular, the complexity of the grid-based approach seeking the best in- 
terpolation, or the complexity of finding the fixed-point solution for the incremental version 
of the grid-based linear-function method. Another interesting issue that needs more inves- 
tigation is the convergence of value iteration with least-squares approximation. Although 
the method can be unstable in the general case, it is possible that under certain restrictions 
it will converge. 

In the paper we use a single POMDP problem (Maze20) only to support theoretical 
findings or to illustrate some intuitions. Therefore, the results not supported theoreti- 
cally (related mostly to control) cannot be generalized and used to rank different methods, 
since their performance may vary on other problems. In general, the area of POMDPs 
and POMDP approximations suffers from a shortage of larger-scale experimental work with 
multiple problems of different complexities and a broad range of methods. Experimental 
work is especially needed to study and compare different methods with regard to control 
quality. The main reason for this is that there are only few theoretical results relating the 
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control performance. These studies should help focus theoretical exploration by discovering 
interesting cases and possibly identifying classes of problems for which certain approxima- 
tions are more or less suitable. Our preliminary experimental results show that there are 
significant differences in control performance among different methods and that not all of 
them may be suitable to approximate the control policies. For example, the grid-based 
nearest-neighbor approach with piecewise-constant approximation is typically inferior to 
and outperformed by other simpler (and more efficient) value-function methods. 

The present work focused on heuristic approximation methods. We investigated gen- 
eral (flat) POMDPs and did not take advantage of any additional structural refinements. 
However, real-world problems usually offer more structure that can be exploited to devise 
new algorithms and perhaps lead to further speed-ups. It is also possible that some of the 
restricted versions of POMDPs (with additional structural assumptions) can be solved or 
approximated efficiently, even though the general complexity results for POMDPs or their e- 
approximations are not very encouraging (Papadimitriou & Tsitsiklis, 1987; Littman, 1996; 
Mundhenk et al., 1997; Lusena et al., 1998; Madani et al., 1999). A challenge here is to 
identify models that allow efficient solutions and are at the same time interesting enough 
from the point of application. 

Finally, a number of interesting issues arise when we move to problems with large state, 
action, and observation spaces. Here, the complexity of not only value-function updates 
but also belief state updates becomes an issue. In general, partial observability of hidden 
process states does not allow us to factor and decompose belief states (and their updates), 
even when transitions have a great deal of structure and can be represented very compactly. 
Promising directions to deal with these issues include various Monte-Carlo approaches (Isard 
& Blake, 1996; Kanazawa, KoUer, & Russell, 1995; Doucet, 1998; Kearns et al., 1999)), 
methods for approximating belief states via decomposition (Boyen & Koller, 1998, 1999), 
or a combination of the two approaches (McAUester Sz Singh, 1999). 
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Appendix A. Theorems and proofs 
A.l Convergence to the Bound 

Theorem 6 Let Hi and H2 be two value-function mappings defined on Vi and V2 s.t. 

1. Hi, H2 are contractions with fixed points Vi , V2 ; 

2. Vi* e V2 and H2V1* > HiV^" = V{ ; 

3. H2 is an isotone mapping. 
Then > holds. 
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Proof By applying H2 to condition 2 and expanding the result with condition 2 again we 
get: H^V{ > H2V{ > HiV{ = V{. Repeating this we get in the Hmit V2 > ■ ■ ■ > H^V{ > 
■ ■ ■ H^Vi* > H2V1* > HiVi* = Vi*, which proves the result. □ 

A. 2 Accuracy of a Lookahead Controller Based on Bounds 

Theorem 7 Let Vu and Vl be upper and lower hounds of the optimal value function for 
the discounted infinite-horizon problem. Let e = sui)fj\Vu(b) — Vl(^)| = \\Vu — Vl\\ be 
the maximum bound difference. Then the expected reward for a lookahead controller V^^, 
constructed for either Vu or Vl, satisfies \\V^^ ~ ^*ll ^ ^{i-"!) ' 

Proof Let V denotes either an upper or lower bound approximation of V* and H^^ be the 
value function mapping corresponding to the lookahead policy for V . Note, that since the 
lookahead policy always optimizes its actions with regard to HV = H^^V must hold. 
The error of V^^ can be bounded using the triangle inequality 



V^"^ -V*\\ < WV^"^ - V\\ + IIF - V*\\. 



The first component satisfies: 



= \\H^^V^^ -V\\ 

< \\H^^V^^ - HV\\ + \\HV -V\\ 

= \\H^^V^^ - H^^V\\ + \\HV -V\\ 

< ^\\V^^ -V\\+e 

The inequality: — y|| < e follows from the isotonicity of and the fact that V is either 
an upper or a lower bound. Rearranging the inequalities, we obtain: yy^"^ ~ ^11 = (i-'y) • 
The bound on the second term ||F — < e is trivial. 

L(i-7) + ~ ^(1-7)- 



Therefore, - V*\\ < e[(^ + 1] = e}f^. □ 



A.3 MDP, QMDP and the Fast Informed Bounds 



Theorem 8 A solution for the fast informed bound approximation can be found by solving 
an MDP with \S\\A\\Q\ states, \A\ actions and the same discount factor 'j. 

Proof Let af be a linear function for action a defining Vi. Let ai{s, a) denote parameters 
of the function. The parameters of Vi+i satisfy: 



ai+i{s,a) = p{s,a) + 7 Y] max V P{s' ,o\s,a)ai{s' ,a). 

Let 



Q!j+i(s, a, 0) = max Pis' , ols, a)ai{s' , a'). 
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Now, we can rewrite ai^i{s,a,o) for every s,a,o as: 



a. 



+i{s,a,o) = max ^ P(s',o|s,a) 



max 

a'eA 



s'gS 



J2 P{s',o\s,a)p{s',a') 

s'GS 



o'ee 
+ 7 



J2 X! 

o'ee s'e5 



These equations define an MDP with state space 5x^x0, action space A and discount 
factor 7. Thus, a solution for the fast informed bound update can be found by solving an 
equivalent finite-state MDP. □ 



Theorem 9 Let corresponds to a piecewise linear convex value function defined by Fj 
linear functions. Then HVi < HfibVi < HQMDpVi < HMDpVi- 

Proof 

max ^ E p{s, a)b{s) + 7 ^ max ^ ^ P{s' , o\s, a)b{s)ai{s') 



{Hvm 



< max b(s) 
" '^^^ ks 

= {HFIBVi){b) 



< max > bis) 



p{s, a) + 7 E max ^ -P(s', o|s, a)ai (s') 
oee s'es 



a) + 7 P(s'|s, a) max 0:1(5') 



iHQMDpVi)ib) 



< y^ 6(,s) max 
se5 



a) + 7 y^ P{s'\s, a) max aj(s') 



(i?Mi5pV^i)(6) □ 



A. 4 Fixed-Strategy Approximations 



Theorem 10 Let CpsM be an FSM controller. Let Cdr and Cla be the direct and the 
one-step-lookahead controllers constructed based on CpsM- Then V'^^^'^{b) < V'^'^^{b) and 
yC'FSM (5) < v^'la (5) hold for all belief states be I. 

Proof The value function for the FSM controller CpsM satisfies: 

VCFSMffj\ = max y (a;, 6) = V(7p(b),b) 

xGM 

where 

V{x, b) = p{b, r]{x)) + 7 E P{o\b, r]{x))V{^{x, o),T{b, r]{x),o)). 

oee 
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The direct controller Cdr selects the action greedily in every step, that is, it always 
chooses according to ■0(6) = arg maxj;g,vf ^(a^, The lookahead controller Cla selects the 
action based on V{x,h) one step away: 



Tj " (b) = arg max 



p(b, a) + 7 I^{o\b, a) max V(x' , rib, a, o)) 
^ x'eM 



By expanding the value function for CpsM for one step we get: 
yCfSM^^^ = maxV{x,b) 



xeM 



max 

xeM 



p(b, rj{x)) + 7 ^ P{o\b, 7]{x))V {(j){x , o), T{b, r/(a;), o)) 



< P{b, vim)) + 7 E max, ^(^'. ^(^ Vm)),o)) 



oee 



< max 



a) + 7 -P(o|6, a) max F (re', t(6, a, o)) 



oee 



ot^ 



(2) 



(3) 



Iteratively expanding maXa-z^M y{x, •) in 2 and 3 with expression 1 and substituing improved 
(higher value) expressions 2 and 3 back we obtain value functions for both the direct and 
the lookahead controllers. (Expansions of 2 lead to the value for the direct controller 
and expansions of 3 to the value for the lookahead controller.) Thus y^f'SM < yCon 
and y*^fSM < yChA niust hold. Note, however, that action choices ?/'(6) and ip^^[h) 
in expressions 2 and 3 can be different leading to different next step belief states and 
subsequently to different expansion sequences. Therefore, the above result does not imply 
that V^^{b) < V^^{b) for all 6 e X. □ 



A. 5 Grid-Based Linear-Function Method 



Theorem 14 Let V-i he a value function obtained via the incremental linear function method, 
starting from Vq, which corresponds to some fi,xed strategy Cq. Let Cla/i (^nd CDR,i be two 
controllers based on Vi: the lookahead controller and the direct action controller, and V'-'^^-% 
yCDR,i f)g their respective value functions. Then Vi < V'-^^^-' and Vi < y^^ij,. hold. 

Proof By initializing the method with a value function for some FSM controller Co, the 

incremental updates can be interpreted as additions of new states to the FSM controller (a 
new linear function corresponds to a new state of the FSM). Let Ci be a controller after 
step i. Then y<^f'SM,i = y. holds and the inequalities follow from Theorem 10. □ 
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